July 5, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • visualizations: try D3 or other for better visualization library
  • postprocessor:
    • test changes to metascraper
    • test changes with dask multithreading
    • finalize trouble-shooting with postprocessor difference on KPP data (capital letters, scope issue, etc)
  • make a private repo on Github and use to store our datasets
  • Alejandro will make a spreadsheet with list of crawls, and information
  • Twitter: embedded tweet issue:

Visualization

  • still looking through documentation of D3

Postprocessor:

  • fix bugs with postprocessor connected to capital letters, question of inconsistency of scope
    • for twitter handle, never use capital letters in defining scope
  • dask multithreading: doesn't seem to work properly, not worth fiddling with it
  • metascraper: it's all working
  • after KPP data question, we'll test NYT Middle East Archive search and then if postprocess results aren't substantially different, new postprocessor will be merged to master

Crawls:

  • theguardian & small domain crawls now working again
    • theguardian: at about 1000,000
    • small domain: 1,000,000
  • Alejandro will send twitter handles

storage of datasets:

  • started to move
  • also possible to run metascraper on old datasets

Action Items

  • finalizing testing of new postprocessor and merge to master if working
  • start NYT Politics Archive postprocessing if postprocessor is done
  • continue learning D3 for edge-node
  • start new crawl with twitter accounts Alejandro will send
  • meet with Alejandro to finalize looking at datasets
  • twitter: embedded tweet issue
  • to discuss next meeting:
    • how to cut a release
    • writing a paper about MediaCAT and architecture

Backburner

  • Apify pre-navigation: probably need a blacklist for each domain, but could look into it in the future
  • using crawler proxies
  • adding to regular postprocessor output:
    1. any non-scope domain hyperlink that ends in .co.il
    2. any link to a tweet or twitter handle
    • This is a bit outside our normal functionality, so I will put it on the backburner for now.
  • what to do with htz.li
  • finding language function
  • image_reference function