March 31, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

add to documentation that the text alias discovery is whole words (so you don't get "mida" for "midair")
write up github issue for apify developer, and ask whether documentation forthcoming for apify v 2
commit and document new system for url expanding
send Alejandro another output set with better retweet URL extraction
add key to postprocessor to allow Twitter API public metrics to propagate through postprocessor
treat Twitter @mentions (whether retweets or other) as a citation
Alejandro: send Shengsong a list of twitter accounts to test with
NYTimes crawl

question about plain text no longer saying "RT"? Might not be important
- currently, plain text is showing the original tweet in retweets, but if someone comments before the retweeted tweet, then this will not appear
- Shengsong will try including both retweet plain text and original tweet plain text and get back to us
add key to postprocessor to allow Twitter API public metrics to propagate through postprocessor - done
treat Twitter @mentions (whether retweets or other) as a citation - done

some URLs weren't being picked up - probably fixed
- send after postprocessing NYtimes interim crawl, we'll check for errors.
nytimes.com crawl:
- crawled 130832 links and then stopped
- question: could be because NYT loading content on scroll?
- Shengsong will look to see if puppeteer has a feature to address loading content on scroll
- benchmarking: looks like the same speed as before, around 60,000 urls a day

look at retweet issue for : how to get both info for retweet tweet and original tweet for plain text and entities
NYT crawl issue: lookto see if puppeteer has a feature to address loading content on scroll
write up github issue for apify developer, and ask whether documentation forthcoming for apify v 2
document new system for url expanding