March 31, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • add to documentation that the text alias discovery is whole words (so you don't get "mida" for "midair")
  • write up github issue for apify developer, and ask whether documentation forthcoming for apify v 2
  • commit and document new system for url expanding
  • send Alejandro another output set with better retweet URL extraction
  • add key to postprocessor to allow Twitter API public metrics to propagate through postprocessor
  • treat Twitter @mentions (whether retweets or other) as a citation
  • Alejandro: send Shengsong a list of twitter accounts to test with
  • NYTimes crawl

Twitter API crawling

  • question about plain text no longer saying "RT"? Might not be important
    • currently, plain text is showing the original tweet in retweets, but if someone comments before the retweeted tweet, then this will not appear
    • Shengsong will try including both retweet plain text and original tweet plain text and get back to us
  • add key to postprocessor to allow Twitter API public metrics to propagate through postprocessor - done
  • treat Twitter @mentions (whether retweets or other) as a citation - done

domain crawling

  • some URLs weren't being picked up - probably fixed
    • send after postprocessing NYtimes interim crawl, we'll check for errors.
  • nytimes.com crawl:
    • crawled 130832 links and then stopped
    • question: could be because NYT loading content on scroll?
    • Shengsong will look to see if puppeteer has a feature to address loading content on scroll
    • benchmarking: looks like the same speed as before, around 60,000 urls a day

Action Items

  • look at retweet issue for : how to get both info for retweet tweet and original tweet for plain text and entities
  • NYT crawl issue: lookto see if puppeteer has a feature to address loading content on scroll
  • write up github issue for apify developer, and ask whether documentation forthcoming for apify v 2
  • document new system for url expanding

Backburner

  • what to do with htz.li
  • small domain crawl
  • Benchmarking
  • finish documenting where different data are on our server
  • finding language function
  • image_reference function
  • dealing with embedded versus cited tweets