May 19, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
- try slower crawl with single call procedure (as discussed above)
- Alejandro: look at proxies for crawling: https://www.blackdown.org/best-datacenter-proxies/
- Monday meeting:
- finish documenting where different data are on our server
- question: adding text aliases and re-running scope?
- one example from KPP data about embedded tweets -- not urgent
- postprocessor refactoring -- to check back next week
Crawl Strategies
- insights from crawling small sites & The Guardian
- stealthy mode works well - no 403
- using 2 threads also worked - no 403, even on middleeasteye
- wait time of 3-4 seconds
- still got 100,000 per day with body html
- problem is that it isn't possible to control speed
- 1 thread with wait time 3-4 seconds: very slow, running into problems with slow load times, eg loading video, where can take 1-3 minutes at times
- 2 threads allows the other thread to keep going
- using the single call to get body html doesn't really slow down the crawler
- try crawling multiple domains with 2 threads each
- possibility of a brake?
- email in util crawl: send email to user and pause the crawler if receive 10 x 403 error or 10 x 429 error
- worth re-crawling nytimes.com?
- probably not
- use proxies? - later
- start NYT Archive/keyword crawl on politics (sent by email) -- will set up
postprocessor refactoring - start now?
- all method and all code in a single file
- to debug, really hard to find
- recommendation: divide into multiple parts: input, find citations refs, and others
- also: more object oriented approach
- wrt data structure: data frame operation will help -- think of them to load full data frame, then operating as columns/rows
- start with dask data frame, built in function: sorting etc
- common functions: load scope (small), load domain data (json - looping and creating dicitonary), load twitter data,
- instead of dictionary, create data frame
server
- finalized deletion of different old datasets - done
Twitter issue: embedded tweet
- not a hurry
Action Items:
- Alejandro: finish updating server documentation
- Alejandro: send sites to crawl
- design crawl brake (pause and email)
- test new method above (2 threads, etc) on small domain list of domains
- main item is postprocessor refactoring along lines stated above
- NYT archive politics crawl
Backburner
- Twitter: embedded tweet issue
- using crawler proxies
- adding to regular postprocessor output:
- any non-scope domain hyperlink that ends in .co.il
- any link to a tweet or twitter handle
- This is a bit outside our normal functionality, so I will put it on the backburner for now.
- what to do with htz.li
- finding language function
- image_reference function
- dealing with embedded versus cited tweets