Oct 12, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki

Crawl/Servers

  • Graham cloud is close to working state
  • Arbutus storage already allocated

Postprocessor

  • met with Nat - fixed on small data set

Action Item

  • try to set up a small instance on Arbutus to test hypothesis about getting new IP - Gy
  • check small instance that's running with local storage, email Alejandro if anything valuable - Gy
  • if Graham goes back online, and time to troubleshoot set up, then do so - Gy
  • look up IA guidelines regarding pausing between calls, or write Nat, and see if possible to crawl faster - Ra
  • look to see if metascraper can work on direct html instead of extra call - Ra
  • send results to Aryan and Francisco to test postprocessor - Ra
  • look at adding article length - Ra
  • check if postprocessor applies tags to citations and citing articles - Fr
  • delete duplicates in Twitter output before running in postprocessor - Fr
  • remove vulnerable files/libraries from archived postprocessor - Fr
  • run postprocessor on: - Fr/Ar
    • (1) dataset from Raazia
    • (2) Foxnews twitter data
  • monitor postprocessor and start trouble-shooting - Ar
  • check if difficult to accept article length for postprocessor - Ar