March 24, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
Twitter API URL expander
support request to Twitter API team about cases without url included from retweets, and also about shortened urls as expanded
postprocessor for Twitter API output - first results
nytimes.com crawl benchmarking
Twitter API URL expander & retweet url extraction
Shengsong has developed an improved version of the expander from John's (using an extra step) -- need to commit and document
retweet problem:
Shengsong has developed a work around, but we will wait for API support to respond to our support request before increasing the accuracy of the work around (70% accurate right now)
Shengsong will send Alejandro another output set with better retweet URL extraction
Postprocessor Twitter API Output
need to add name/associated publisher to crawl_scope in order to get it in postprocessed output
Shengsong will add a key for Twitter API public metrics to propagate to postprocessed output
NYTimes.com crawl
waiting on error checking from RA
Action Items
write up github issue for apify developer, and ask whether documentation forthcoming for apify v 2
commit and document new system for url expanding
send Alejandro another output set with better retweet URL extraction
add key to postprocessor to allow Twitter API public metrics to propagate through postprocessor
treat Twitter @mentions (whether retweets or other) as a citation
Alejandro: send Shengsong a list of twitter accounts to test with
Backburner
what to do with htz.li
small domain crawl
Benchmarking
finish documenting where different data are on our server