April 14, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
meeting times - possible later or another day?
Shengsong's last day for April?
CDHI conference
NYT crawl -- attempt combination of the following techniques:
use NYTimes search with "load more" or similar strategies
bypass paywall or reader view function
crawl with scroll-down function
problem with puppeteer error after about 500
accumulate urls from internet archive of RSS feeds
retweet/tweet issue
re-do KPP/MediaCAT
document new system for url expanding
NYT Crawl
readability works now - js.node was an earlier version: we were using v14 and now on v16
enormous improvement in speed: at least 120,000 urls per day, perhaps 150,000
with "load more" on NYT archive; it got to 40,000 6-7 hours
problem with puppeteer error after about 500:
solved by scrolling only 10 and then new search parameters
with scroll down: 100,000 per day
NYT regular site: now working past the limitations of first regular crawl
500,000+
space for improvement even with "scroll down": probably can make even faster but need to research
question: will we need to tweak for every site? or is there a standard domain crawler that will work for most sites, and then tweaks for others?
probably for most sites, it will work fine
document strategy: try to crawl most of the sites you have in scope, where it doesn't work, then decide whether that site is important, and then try to tweak
still need to accumulate urls from internet archive of RSS feeds?
not necessary
looking ahead to visualizations: will want to do visualizations based on tags of sources
not a problem to add
new Graham instance
500 GB ram, 40 CPU (from 16 CPU), 1.2 TB storage
can do major crawl on this - NYT archive is crawling there
retweet/tweet issue
if not done, probably good idea to finalize this so as to recrawl the KPP/MediaCAT list and run postprocessor
hope to get to this tmrw
Action Items
finalize clean up, updating, and documentation of methods of NYT crawl
look at retweet/tweet issue
re-run KPP/MediaCAT twitter crawl
run small domain crawl with information from Alejandro
Alejandro: think through proposals for CDHI conference
Alejandro: find new time for weekly meeting
Backburner
to being in May: assessment of any updates needed for libraries
adding to regular postprocessor output:
any non-scope domain hyperlink that ends in .co.il
any link to a tweet or twitter handle
This is a bit outside our normal functionality, so I will put it on the backburner for now.
what to do with htz.li
small domain crawl
Benchmarking
finish documenting where different data are on our server