April 5, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

look at retweet issue for : how to get both info for retweet tweet and original tweet for plain text and entities
NYT crawl issue: lookto see if puppeteer has a feature to address loading content on scroll
write up github issue for apify developer, and ask whether documentation forthcoming for apify v 2
document new system for url expanding

NYT APIs doesn't return a whole lot of articles
another possibility is to look at Web Archive to find NYTimes
questions: following NYT links with scroller?
- still problem with paywall pop up
- use "bypass paywall"?
RSS feeds?
- available at web archive
attempt combination of the following techniques:
- use NYTimes search with "load more" or similar strategies
- bypass paywall or reader view function
- crawl with scroll-down function
  - problem with puppeteer error after about 500
- accumulate urls from internet archive of RSS feeds
otherwise, attempt to use Internet Archive