March 17, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • new policy for url extender
    • URL extender issues: double shortening and retweets
  • finalize puppeteer update
  • finalize crawl of timelines for KPP/MediaCAT: 60,001 + tweets
  • make postprocessor able to read twitter API output
  • Alejandro/RA checking output from postprocessor extraction of plain text

URL extender for Twitter API results

  • retweets: URL not included
    • used library called request to fetch original url from the original tweet
  • extender errors:
    • added code to get every shortened URL that works to find the original URL
  • need to send request every time for shortened url and also to get original url from retweeted tweet -- slows down
    • perhaps slow down the requests to ensure we aren't blocked
    • each request from puppeteer actually takes a few seconds
  • send support request to Twitter

puppeteer update: from apify 1.3.4 to apify 1.3.6

  • apify v 1 patch - apify uses puppeteer, documentation for v 1 but not for v 2
    • v 2: tried it and it gave bugs
  • apify v 1 is functioning fine, no benchmark yet
  • Shengsong will check for documentation on a weekly basis

finalized crawl of timelines KPP/MediaCAT

  • we have 3 files of less than 1 million rows with the entire corpus

postprocessor to read Twitter API output

  • not yet

Action Items

  • Shengsong to Alejandro: send support request to Twitter API team about cases without url included from retweets, and also about shortened urls as expanded
  • postprocessor for Twitter API output
  • nytimes.com crawl benchmarking

Backburner

  • what to do with htz.li
  • small domain crawl
  • Benchmarking
  • finish documenting where different data are on our server
  • finding language function
  • image_reference function
  • dealing with embedded versus cited tweets