April 5, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • look at retweet issue for : how to get both info for retweet tweet and original tweet for plain text and entities
  • NYT crawl issue: lookto see if puppeteer has a feature to address loading content on scroll
  • write up github issue for apify developer, and ask whether documentation forthcoming for apify v 2
  • document new system for url expanding

NYT crawl issue

  • NYT APIs doesn't return a whole lot of articles
  • another possibility is to look at Web Archive to find NYTimes
  • questions: following NYT links with scroller?
    • still problem with paywall pop up
    • use "bypass paywall"?
  • RSS feeds?
    • available at web archive
  • attempt combination of the following techniques:
    • use NYTimes search with "load more" or similar strategies
    • bypass paywall or reader view function
    • crawl with scroll-down function
      • problem with puppeteer error after about 500
    • accumulate urls from internet archive of RSS feeds
  • otherwise, attempt to use Internet Archive

apify update

  • completed to Apify v 2.2

Action items

  • NYT crawl -- attempt combination of the following techniques:
    • use NYTimes search with "load more" or similar strategies
    • bypass paywall or reader view function
    • crawl with scroll-down function
      • problem with puppeteer error after about 500
    • accumulate urls from internet archive of RSS feeds
  • retweet/tweet issue
    • re-do KPP/MediaCAT
  • document new system for url expanding

Backburner

  • what to do with htz.li
  • small domain crawl
  • Benchmarking
  • finish documenting where different data are on our server
  • finding language function
  • image_reference function
  • dealing with embedded versus cited tweets