June 23, 2022 - UTMediaCAT/mediacat-docs Wiki

Agenda:

  • the guardian crawl: filter out comments urls
  • NYT Mid E archive: test on new postprocessor
  • postprocessor: adding twitter counts to data structure
  • update to metascraper to include db to deal with errors and with re-starting after being stopped
  • visualizations: figure out jupyter
  • Alejandro: need more examples of embedded tweet issue, and send list of visualizations

Crawls Update:

  • small domain: same, pause
  • the guardian: same, pause
  • NYT Politics Archive postprocessing? not yet started

Postprocessor:

  • Twitter: embedded tweet issue:
    • still working on it
  • NYT Mid E archive: test on new postprocessor
  • KPP postprocessor: missing a few thousand, maybe small code error, troubleshooting
  • postprocessor: adding twitter counts to data structure
    • done
  • update to metascraper to include db to deal with errors and with re-starting after being stopped
    • Shengsong will work on this in Arbutus while we await Graham to return
  • trying to add multithreading to postprocessor but then ended up slower -- why?, DASK supported
    • Nat will look at the video of the postprocessor changes
  • DASK method to loop through the data was not very fast, for-loop is faster

Visualizations

  • making jupyter notebook work
  • some of the earlier work from Alice could be helpful hear, by domain and url for vector diagrams
  • Shengsong will send the stacked area charts and one way vector diagrams
  • can use D3 for visualizations -- can use it in local server

Action Items

  • send the stacked area charts and one way vector diagrams from KPP data
  • troubleshooting errors on postprocessor discovered with KPP testing
  • update to metascraper to include db to deal with errors and with re-starting after being stopped - use arbutus
  • dask multithreading for postprocessor - trouble-shooting why slower
  • consider D3 or similar for visualizing vector diagrams
  • Twitter: embedded tweet issue:
  • when Graham back on: * the guardian crawl: filter out comments urls
  • for next week: consider borealis to store datasets

Backburner

  • Apify pre-navigation: probably need a blacklist for each domain, but could look into it in the future
  • using crawler proxies
  • adding to regular postprocessor output:
    1. any non-scope domain hyperlink that ends in .co.il
    2. any link to a tweet or twitter handle
    • This is a bit outside our normal functionality, so I will put it on the backburner for now.
  • what to do with htz.li
  • finding language function
  • image_reference function