March 17, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
new policy for url extender
URL extender issues: double shortening and retweets
finalize puppeteer update
finalize crawl of timelines for KPP/MediaCAT: 60,001 + tweets
make postprocessor able to read twitter API output
Alejandro/RA checking output from postprocessor extraction of plain text
URL extender for Twitter API results
retweets: URL not included
used library called request to fetch original url from the original tweet
extender errors:
added code to get every shortened URL that works to find the original URL
need to send request every time for shortened url and also to get original url from retweeted tweet -- slows down
perhaps slow down the requests to ensure we aren't blocked
each request from puppeteer actually takes a few seconds
send support request to Twitter
puppeteer update: from apify 1.3.4 to apify 1.3.6
apify v 1 patch - apify uses puppeteer, documentation for v 1 but not for v 2
v 2: tried it and it gave bugs
apify v 1 is functioning fine, no benchmark yet
Shengsong will check for documentation on a weekly basis
finalized crawl of timelines KPP/MediaCAT
we have 3 files of less than 1 million rows with the entire corpus
postprocessor to read Twitter API output
not yet
Action Items
Shengsong to Alejandro: send support request to Twitter API team about cases without url included from retweets, and also about shortened urls as expanded
postprocessor for Twitter API output
nytimes.com crawl benchmarking
Backburner
what to do with htz.li
small domain crawl
Benchmarking
finish documenting where different data are on our server