April 14, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • meeting times - possible later or another day?
  • Shengsong's last day for April?
  • CDHI conference
  • NYT crawl -- attempt combination of the following techniques:
    • use NYTimes search with "load more" or similar strategies
    • bypass paywall or reader view function
    • crawl with scroll-down function
      • problem with puppeteer error after about 500
    • accumulate urls from internet archive of RSS feeds
  • retweet/tweet issue
    • re-do KPP/MediaCAT
  • document new system for url expanding

NYT Crawl

  • readability works now - js.node was an earlier version: we were using v14 and now on v16
    • enormous improvement in speed: at least 120,000 urls per day, perhaps 150,000
    • with "load more" on NYT archive; it got to 40,000 6-7 hours
      • problem with puppeteer error after about 500:
      • solved by scrolling only 10 and then new search parameters
    • with scroll down: 100,000 per day
      • NYT regular site: now working past the limitations of first regular crawl
      • 500,000+
    • space for improvement even with "scroll down": probably can make even faster but need to research
  • question: will we need to tweak for every site? or is there a standard domain crawler that will work for most sites, and then tweaks for others?
    • probably for most sites, it will work fine
    • document strategy: try to crawl most of the sites you have in scope, where it doesn't work, then decide whether that site is important, and then try to tweak
  • still need to accumulate urls from internet archive of RSS feeds?
  • not necessary
  • looking ahead to visualizations: will want to do visualizations based on tags of sources
    • not a problem to add

new Graham instance

  • 500 GB ram, 40 CPU (from 16 CPU), 1.2 TB storage
  • can do major crawl on this - NYT archive is crawling there

retweet/tweet issue

  • if not done, probably good idea to finalize this so as to recrawl the KPP/MediaCAT list and run postprocessor
  • hope to get to this tmrw

Action Items

  • finalize clean up, updating, and documentation of methods of NYT crawl
  • look at retweet/tweet issue
  • re-run KPP/MediaCAT twitter crawl
  • run small domain crawl with information from Alejandro
  • Alejandro: think through proposals for CDHI conference
  • Alejandro: find new time for weekly meeting

Backburner

  • to being in May: assessment of any updates needed for libraries
  • adding to regular postprocessor output:
    1. any non-scope domain hyperlink that ends in .co.il
    2. any link to a tweet or twitter handle
    • This is a bit outside our normal functionality, so I will put it on the backburner for now.
  • what to do with htz.li
  • small domain crawl
  • Benchmarking
  • finish documenting where different data are on our server
  • finding language function
  • image_reference function
  • dealing with embedded versus cited tweets