Jan 25, 2024 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
- if time over break, begin to look into combining crawl results and eliminating duplicates - Gy
- start with a small batch of each one - 50-100?
- move combined results to new folder so that original crawl results are preserved until assured that we have the right ones
- document protocol used to combine results
- contact Nat about new kind of error with Arbutus cloud with nytimes re-start - Gy
- attempt re-start of NYT Archive Mid E (both Graham and Arbutus IPs) - Gy
- try to restart IA crawl and if it doesn't restart, send email inquiry about being blocked - Ra
- figure out estimate actual articles for NYT and document protocol for estimating actual articles on IA - Ra
Crawls
- starting to combine, should be done by reading week
- fix new problem on Arbutus - doesn't always appear, and then problem disappears with server wide update (Arbutus) - not focusing on it right now
IA Crawl
- Electronic Intifada crawl restarted almost done
- NYT estimating: 1000 w/ Mid E -- check with world
Post-processing:
- figured out the postprocessing problem but taking some time to do a dataset of about 10 million tweets
- converted IA dataset (30-40,000) - 230 output
unit-testing:
Action Items
- attempt re-start of NYT Archive Mid E (both Graham and Arbutus IPs) - Gy
- continue combinging results and document - Gy
- if time, contact Nat about new kind of error with Arbutus cloud with nytimes re-start - Gy
- unit-testing each function of the postprocessor for IA dataset - Ar
- if unit-testing shows accuracy, then request IA Electronic Intifada dataset from Raazia and proceed with postprocessing - Ar
- continue with postprocessing of WaPo - Fr
- continue to work on figuring out IA NYT estimating - Ra