May 26, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • Alejandro: finish updating server documentation
  • Alejandro: send sites to crawl
  • design crawl brake (pause and email)
  • test new method above (2 threads, etc) on small domain list of domains
  • main item is postprocessor refactoring along lines stated above
  • NYT archive politics crawl

crawl brake (pause and email)

  • Apify is not good at error handling, has error function but only caught under certain circumstances
    • if a url fails, url added to queue and error only registers if re-try fails
  • puppeteer error handling: need a number to crawl round (eg 1000) and then have error registering: if exceeds 50, then pause
  • Apify has dataset for failed url: will appear in apify storage, same place as request queue
    • Domain_crawler/guardian_2022_05_12/mediacat-domain-crawler/newCrawler/apify_storage/datasets
  • added brake when the queue goes to 0 and send an email

apify function to avoid URLs like videos

  • Shengsong tried it but it doesn't seem to be working, gets error "undefined"
  • pre-navigation: probably need a blacklist for each domain, but could look into it in the future

test new method on small domain list

  • tested on 5 domains, with stealthy 2 thread 4-5 sec, no block error
  • middleeasteye got 2 million urls:
    • test pre-navigation on middleeasteye at some
  • check in next week to see if finished the 10
  • it is possible to crawl in rounds: apify
    • go in rounds in order to reduce the pause time between calls to a given domain
    • we can set the number of urls from each domain, e.g., 500
    • theoretically, we could have enough domains that we wouldn't need a pause at all, but not for most crawling
    • document crawling in rounds

postprocessor refactor

  1. input processing and output
  2. further divide twitter & domain
  3. probably further divided after that

NYT archive politics crawl

  • still running - about 200,000 finished of 800,000

Action Items

  • add documentation about:
    • crawler numbers for error registering and pausing the crawler
    • brake when queue goes to 0
    • apify crawl in rounds

Backburner

  • Twitter: embedded tweet issue
  • Apify pre-navigation: probably need a blacklist for each domain, but could look into it in the future
  • using crawler proxies
  • adding to regular postprocessor output:
    1. any non-scope domain hyperlink that ends in .co.il
    2. any link to a tweet or twitter handle
    • This is a bit outside our normal functionality, so I will put it on the backburner for now.
  • what to do with htz.li
  • finding language function
  • image_reference function
  • dealing with embedded versus cited tweets