May 26, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
- Alejandro: finish updating server documentation
- Alejandro: send sites to crawl
- design crawl brake (pause and email)
- test new method above (2 threads, etc) on small domain list of domains
- main item is postprocessor refactoring along lines stated above
- NYT archive politics crawl
crawl brake (pause and email)
- Apify is not good at error handling, has error function but only caught under certain circumstances
- if a url fails, url added to queue and error only registers if re-try fails
- puppeteer error handling: need a number to crawl round (eg 1000) and then have error registering: if exceeds 50, then pause
- Apify has dataset for failed url: will appear in apify storage, same place as request queue
- Domain_crawler/guardian_2022_05_12/mediacat-domain-crawler/newCrawler/apify_storage/datasets
- added brake when the queue goes to 0 and send an email
apify function to avoid URLs like videos
- Shengsong tried it but it doesn't seem to be working, gets error "undefined"
- pre-navigation: probably need a blacklist for each domain, but could look into it in the future
test new method on small domain list
- tested on 5 domains, with stealthy 2 thread 4-5 sec, no block error
- middleeasteye got 2 million urls:
- test pre-navigation on middleeasteye at some
- check in next week to see if finished the 10
- it is possible to crawl in rounds: apify
- go in rounds in order to reduce the pause time between calls to a given domain
- we can set the number of urls from each domain, e.g., 500
- theoretically, we could have enough domains that we wouldn't need a pause at all, but not for most crawling
- document crawling in rounds
postprocessor refactor
- input processing and output
- further divide twitter & domain
- probably further divided after that
NYT archive politics crawl
- still running - about 200,000 finished of 800,000
Action Items
- add documentation about:
- crawler numbers for error registering and pausing the crawler
- brake when queue goes to 0
- apify crawl in rounds
Backburner
- Twitter: embedded tweet issue
- Apify pre-navigation: probably need a blacklist for each domain, but could look into it in the future
- using crawler proxies
- adding to regular postprocessor output:
- any non-scope domain hyperlink that ends in .co.il
- any link to a tweet or twitter handle
- This is a bit outside our normal functionality, so I will put it on the backburner for now.
- what to do with htz.li
- finding language function
- image_reference function
- dealing with embedded versus cited tweets