May 19, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

try slower crawl with single call procedure (as discussed above)
Alejandro: look at proxies for crawling: https://www.blackdown.org/best-datacenter-proxies/
Monday meeting:
- finish documenting where different data are on our server
- question: adding text aliases and re-running scope?
- one example from KPP data about embedded tweets -- not urgent
postprocessor refactoring -- to check back next week

insights from crawling small sites & The Guardian
- stealthy mode works well - no 403
- using 2 threads also worked - no 403, even on middleeasteye
  - wait time of 3-4 seconds
  - still got 100,000 per day with body html
  - problem is that it isn't possible to control speed
  - 1 thread with wait time 3-4 seconds: very slow, running into problems with slow load times, eg loading video, where can take 1-3 minutes at times
  - 2 threads allows the other thread to keep going
- using the single call to get body html doesn't really slow down the crawler
- try crawling multiple domains with 2 threads each
possibility of a brake?
- email in util crawl: send email to user and pause the crawler if receive 10 x 403 error or 10 x 429 error
worth re-crawling nytimes.com?
- probably not
use proxies? - later
start NYT Archive/keyword crawl on politics (sent by email) -- will set up

Twitter: embedded tweet issue
using crawler proxies
adding to regular postprocessor output:
1. any non-scope domain hyperlink that ends in .co.il
2. any link to a tweet or twitter handle
- This is a bit outside our normal functionality, so I will put it on the backburner for now.
what to do with htz.li
finding language function
image_reference function
dealing with embedded versus cited tweets