Jan 25, 2024 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

if time over break, begin to look into combining crawl results and eliminating duplicates - Gy
- start with a small batch of each one - 50-100?
- move combined results to new folder so that original crawl results are preserved until assured that we have the right ones
- document protocol used to combine results
contact Nat about new kind of error with Arbutus cloud with nytimes re-start - Gy
attempt re-start of NYT Archive Mid E (both Graham and Arbutus IPs) - Gy
try to restart IA crawl and if it doesn't restart, send email inquiry about being blocked - Ra
figure out estimate actual articles for NYT and document protocol for estimating actual articles on IA - Ra

starting to combine, should be done by reading week
fix new problem on Arbutus - doesn't always appear, and then problem disappears with server wide update (Arbutus) - not focusing on it right now

figured out the postprocessing problem but taking some time to do a dataset of about 10 million tweets
converted IA dataset (30-40,000) - 230 output

attempt re-start of NYT Archive Mid E (both Graham and Arbutus IPs) - Gy
continue combinging results and document - Gy
if time, contact Nat about new kind of error with Arbutus cloud with nytimes re-start - Gy
unit-testing each function of the postprocessor for IA dataset - Ar
if unit-testing shows accuracy, then request IA Electronic Intifada dataset from Raazia and proceed with postprocessing - Ar
continue with postprocessing of WaPo - Fr
continue to work on figuring out IA NYT estimating - Ra