Jan 25, 2024 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • if time over break, begin to look into combining crawl results and eliminating duplicates - Gy
    • start with a small batch of each one - 50-100?
    • move combined results to new folder so that original crawl results are preserved until assured that we have the right ones
    • document protocol used to combine results
  • contact Nat about new kind of error with Arbutus cloud with nytimes re-start - Gy
  • attempt re-start of NYT Archive Mid E (both Graham and Arbutus IPs) - Gy
  • try to restart IA crawl and if it doesn't restart, send email inquiry about being blocked - Ra
  • figure out estimate actual articles for NYT and document protocol for estimating actual articles on IA - Ra

Crawls

  • starting to combine, should be done by reading week
  • fix new problem on Arbutus - doesn't always appear, and then problem disappears with server wide update (Arbutus) - not focusing on it right now

IA Crawl

  • Electronic Intifada crawl restarted almost done
  • NYT estimating: 1000 w/ Mid E -- check with world

Post-processing:

  • figured out the postprocessing problem but taking some time to do a dataset of about 10 million tweets
  • converted IA dataset (30-40,000) - 230 output

unit-testing:

  • this week

Action Items

  • attempt re-start of NYT Archive Mid E (both Graham and Arbutus IPs) - Gy
  • continue combinging results and document - Gy
  • if time, contact Nat about new kind of error with Arbutus cloud with nytimes re-start - Gy
  • unit-testing each function of the postprocessor for IA dataset - Ar
  • if unit-testing shows accuracy, then request IA Electronic Intifada dataset from Raazia and proceed with postprocessing - Ar
  • continue with postprocessing of WaPo - Fr
  • continue to work on figuring out IA NYT estimating - Ra