March 31, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
- add to documentation that the text alias discovery is whole words (so you don't get "mida" for "midair")
- write up github issue for apify developer, and ask whether documentation forthcoming for apify v 2
- commit and document new system for url expanding
- send Alejandro another output set with better retweet URL extraction
- add key to postprocessor to allow Twitter API public metrics to propagate through postprocessor
- treat Twitter @mentions (whether retweets or other) as a citation
- Alejandro: send Shengsong a list of twitter accounts to test with
- NYTimes crawl
Twitter API crawling
- question about plain text no longer saying "RT"? Might not be important
- currently, plain text is showing the original tweet in retweets, but if someone comments before the retweeted tweet, then this will not appear
- Shengsong will try including both retweet plain text and original tweet plain text and get back to us
- add key to postprocessor to allow Twitter API public metrics to propagate through postprocessor - done
- treat Twitter @mentions (whether retweets or other) as a citation - done
domain crawling
- some URLs weren't being picked up - probably fixed
- send after postprocessing NYtimes interim crawl, we'll check for errors.
- nytimes.com crawl:
- crawled 130832 links and then stopped
- question: could be because NYT loading content on scroll?
- Shengsong will look to see if puppeteer has a feature to address loading content on scroll
- benchmarking: looks like the same speed as before, around 60,000 urls a day
Action Items
- look at retweet issue for : how to get both info for retweet tweet and original tweet for plain text and entities
- NYT crawl issue: lookto see if puppeteer has a feature to address loading content on scroll
- write up github issue for apify developer, and ask whether documentation forthcoming for apify v 2
- document new system for url expanding
Backburner
- what to do with htz.li
- small domain crawl
- Benchmarking
- finish documenting where different data are on our server
- finding language function
- image_reference function
- dealing with embedded versus cited tweets