June 16, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda:

  • crawl/postprocessor updates
  • documentation and new repo for new postprocessor
  • adding twitter counts to data structure
  • Twitter: embedded tweet issue
  • testing new postprocessor on KPP & old NYT and new NYT data to see if discrepancy
  • producing visualizations with KPP data

crawl/postprocessor updates

  • small domain crawler still running
  • NYT politics archive postprocessing: still on-going as of yesterday, about half way through
  • theguardian crawl: going fine, some blocks due to "comments" url's which Shengsong will try to filter out through pre-navigation and regex
  • NYT & KPP postprocessing:
    • KPP: about 200,000 and got good result -- no issue, and checked against earlier results and they were the same
    • NYT Mid E archive: hasn't tried yet

new postprocessor documentation and repo

  • metascraper updates:
    • if metascraper has errors, then no way to know what they are
    • if server stops, no way to know where we were, and then need to re-start from the beginning rather than from where it stopped
    • solution proposed: use db to store the data that has been finished, and then can know where to continue; can use pandas to store;
      • unlikely to have much effect on the speed of the metascraper
  • adding twitter counts to data structure
    • not yet

embedded tweet issue

  • need more examples

visualizations with KPP data

  • stacked area diagram, also node vector

Action Items:

  • the guardian crawl: filter out comments urls
  • NYT Mid E archive: test on new postprocessor
  • postprocessor: adding twitter counts to data structure
  • update to metascraper to include db to deal with errors and with re-starting after being stopped
  • visualizations: figure out jupyter
  • Alejandro: need more examples of embedded tweet issue, and send list of visualizations

Backburner

  • Apify pre-navigation: probably need a blacklist for each domain, but could look into it in the future
  • using crawler proxies
  • adding to regular postprocessor output:
    1. any non-scope domain hyperlink that ends in .co.il
    2. any link to a tweet or twitter handle
    • This is a bit outside our normal functionality, so I will put it on the backburner for now.
  • what to do with htz.li
  • finding language function
  • image_reference function
  • dealing with embedded versus cited tweets