March 24, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • Twitter API URL expander
  • support request to Twitter API team about cases without url included from retweets, and also about shortened urls as expanded
  • postprocessor for Twitter API output - first results
  • nytimes.com crawl benchmarking

Twitter API URL expander & retweet url extraction

  • Shengsong has developed an improved version of the expander from John's (using an extra step) -- need to commit and document
  • retweet problem:
    • Shengsong has developed a work around, but we will wait for API support to respond to our support request before increasing the accuracy of the work around (70% accurate right now)
    • Shengsong will send Alejandro another output set with better retweet URL extraction

Postprocessor Twitter API Output

  • need to add name/associated publisher to crawl_scope in order to get it in postprocessed output
  • Shengsong will add a key for Twitter API public metrics to propagate to postprocessed output

NYTimes.com crawl

  • waiting on error checking from RA

Action Items

  • write up github issue for apify developer, and ask whether documentation forthcoming for apify v 2
  • commit and document new system for url expanding
  • send Alejandro another output set with better retweet URL extraction
  • add key to postprocessor to allow Twitter API public metrics to propagate through postprocessor
  • treat Twitter @mentions (whether retweets or other) as a citation
  • Alejandro: send Shengsong a list of twitter accounts to test with

Backburner

  • what to do with htz.li
  • small domain crawl
  • Benchmarking
  • finish documenting where different data are on our server
  • finding language function
  • image_reference function
  • dealing with embedded versus cited tweets