May 5, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
send re-processed NYT Archive crawl results
document the following: that there is going to be a limit to the size of the dataset that can be processed by the postprocessor, but we can't know in advance what it is
finalize post-processing of NYT regular crawl, and consider the earliest articles, looking for invalid URLs
look at smaller libraries in postprocessor to see if need updating
group KPP/MediaCAT results in groups of 500,000
re-start small domain crawl: if error returns, Shengsong will trouble-shoot
check to see why postprocessor of small domain crawl only produced 6000 relevant hits
NYT Archive
sent, Alejandro working through them
Regular NYT crawl
only got 60 rows trying to do whole thing as postprocessor - very few of 900,000 are from Middle East section
90% of the URLs are useless
until we figure out why the limit to crawling, then we may leave this kind of crawling of NYT for now
try cnn.com instead
Postprocessing limit issue
one possibility is to try something other than plain python, like https://dask.org/
Small Domain Crawl
1.5 million URLs
hitting forbidden error on mondoweiss and middleeasteye
could be blocking certain requests
worth trying with a different IP;
or try postman
different user agent
headless browser: looks like user, but rapid requests can be a reason for blocking
may need to time requests differently
need to consider best practices for not overwhelming websites: