April 5, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
- look at retweet issue for : how to get both info for retweet tweet and original tweet for plain text and entities
- NYT crawl issue: lookto see if puppeteer has a feature to address loading content on scroll
- write up github issue for apify developer, and ask whether documentation forthcoming for apify v 2
- document new system for url expanding
NYT crawl issue
- NYT APIs doesn't return a whole lot of articles
- another possibility is to look at Web Archive to find NYTimes
- questions: following NYT links with scroller?
- still problem with paywall pop up
- use "bypass paywall"?
- RSS feeds?
- attempt combination of the following techniques:
- use NYTimes search with "load more" or similar strategies
- bypass paywall or reader view function
- crawl with scroll-down function
- problem with puppeteer error after about 500
- accumulate urls from internet archive of RSS feeds
- otherwise, attempt to use Internet Archive
apify update
Action items
- NYT crawl -- attempt combination of the following techniques:
- use NYTimes search with "load more" or similar strategies
- bypass paywall or reader view function
- crawl with scroll-down function
- problem with puppeteer error after about 500
- accumulate urls from internet archive of RSS feeds
- retweet/tweet issue
- document new system for url expanding
Backburner
- what to do with htz.li
- small domain crawl
- Benchmarking
- finish documenting where different data are on our server
- finding language function
- image_reference function
- dealing with embedded versus cited tweets