June 28, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • send the stacked area charts and one way vector diagrams from KPP data
  • troubleshooting errors on postprocessor discovered with KPP testing
  • update to metascraper to include db to deal with errors and with re-starting after being stopped - use arbutus
  • dask multithreading for postprocessor - trouble-shooting why slower
  • consider D3 or similar for visualizing vector diagrams
  • Twitter: embedded tweet issue:
  • when Graham back on:
    • the guardian crawl: filter out comments urls
  • for next week: consider borealis to store datasets

Visualizations

  • stacked area charts and vector diagrams?
  • making jupyter notebook work
  • some of the earlier work from Alice could be helpful hear, by domain and url for vector diagrams
  • can use D3 for visualizations -- can use it in local server

Postprocessor:

  • update to metascraper to include db to deal with errors and with re-starting after being stopped - use arbutus
  • seems to be working, and Shengsong will test on Graham when return
  • dask multithreading for postprocessor - trouble-shooting why slower
  • not yet, need large data set
  • errors with new postprocessor
  • seems to be certain errors with capital letters, and maybe some problem parsing of citation scope

crawls:

  • assuming they are on hold

storage:

  • make a private repo on Github and use to store our datasets
  • Alejandro will make a spreadsheet with list of crawls, and information

action items

  • visualizations: try D3 or other for better visualization library
  • postprocessor:
    • test changes to metascraper
    • test changes with dask multithreading
    • finalize trouble-shooting with postprocessor difference on KPP data (capital letters, scope issue, etc)
  • make a private repo on Github and use to store our datasets
  • Alejandro will make a spreadsheet with list of crawls, and information
  • Twitter: embedded tweet issue:

Backburner

  • Apify pre-navigation: probably need a blacklist for each domain, but could look into it in the future
  • using crawler proxies
  • adding to regular postprocessor output:
    1. any non-scope domain hyperlink that ends in .co.il
    2. any link to a tweet or twitter handle
    • This is a bit outside our normal functionality, so I will put it on the backburner for now.
  • what to do with htz.li
  • finding language function
  • image_reference function