May 26, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

Apify is not good at error handling, has error function but only caught under certain circumstances
- if a url fails, url added to queue and error only registers if re-try fails
puppeteer error handling: need a number to crawl round (eg 1000) and then have error registering: if exceeds 50, then pause
- user has to set number for crawl round and then the number for errors before pausing (will be added to documentation)
- we'll whitelist 404 errors
- one list of crawling errors: https://www.bing.com/webmasters/help/crawl-error-alerts-e29a3f3e
Apify has dataset for failed url: will appear in apify storage, same place as request queue
- Domain_crawler/guardian_2022_05_12/mediacat-domain-crawler/newCrawler/apify_storage/datasets
added brake when the queue goes to 0 and send an email

Shengsong tried it but it doesn't seem to be working, gets error "undefined"
pre-navigation: probably need a blacklist for each domain, but could look into it in the future

tested on 5 domains, with stealthy 2 thread 4-5 sec, no block error
middleeasteye got 2 million urls:
- test pre-navigation on middleeasteye at some
check in next week to see if finished the 10
it is possible to crawl in rounds: apify
- go in rounds in order to reduce the pause time between calls to a given domain
- we can set the number of urls from each domain, e.g., 500
- theoretically, we could have enough domains that we wouldn't need a pause at all, but not for most crawling
- document crawling in rounds

add documentation about:
- crawler numbers for error registering and pausing the crawler
- brake when queue goes to 0
- apify crawl in rounds

Twitter: embedded tweet issue
Apify pre-navigation: probably need a blacklist for each domain, but could look into it in the future
using crawler proxies
adding to regular postprocessor output:
1. any non-scope domain hyperlink that ends in .co.il
2. any link to a tweet or twitter handle
- This is a bit outside our normal functionality, so I will put it on the backburner for now.
what to do with htz.li
finding language function
image_reference function
dealing with embedded versus cited tweets