April 14, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

readability works now - js.node was an earlier version: we were using v14 and now on v16
- enormous improvement in speed: at least 120,000 urls per day, perhaps 150,000
- with "load more" on NYT archive; it got to 40,000 6-7 hours
  - problem with puppeteer error after about 500:
  - solved by scrolling only 10 and then new search parameters
- with scroll down: 100,000 per day
  - NYT regular site: now working past the limitations of first regular crawl
  - 500,000+
- space for improvement even with "scroll down": probably can make even faster but need to research
question: will we need to tweak for every site? or is there a standard domain crawler that will work for most sites, and then tweaks for others?
- probably for most sites, it will work fine
- document strategy: try to crawl most of the sites you have in scope, where it doesn't work, then decide whether that site is important, and then try to tweak
still need to accumulate urls from internet archive of RSS feeds?
not necessary
looking ahead to visualizations: will want to do visualizations based on tags of sources
- not a problem to add

if not done, probably good idea to finalize this so as to recrawl the KPP/MediaCAT list and run postprocessor
hope to get to this tmrw

to being in May: assessment of any updates needed for libraries
adding to regular postprocessor output:
1. any non-scope domain hyperlink that ends in .co.il
2. any link to a tweet or twitter handle
- This is a bit outside our normal functionality, so I will put it on the backburner for now.
what to do with htz.li
small domain crawl
Benchmarking
finish documenting where different data are on our server
finding language function
image_reference function
dealing with embedded versus cited tweets