Jan 11, 2024 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

if time over break, begin to look into combining crawl results and eliminating duplicates - Gy
- start with a small batch of each one - 50-100?
- move combined results to new folder so that original crawl results are preserved until assured that we have the right ones
- document protocol used to combine results
contact Nat about new kind of error with Arbutus cloud with nytimes re-start - Gy
attempt re-start of NYT Archive Mid E (both Graham and Arbutus IPs) - Gy
figure out CDX API with some of the features mentioned by IA developer - RA
once using new API - set up electronicintifada & Jadaliyya crawls on IA - Ra
figure out estimate actual articles for NYT and document protocol for estimating actual articles on IA - Ra
send email about URLs and new URLs and anything else about estimating actual articles - RA
follow with Nat to find time for next week if possible - Fr

changed the code for the CDX API, seems to be working better
- filtering before might have been too aggressive
set up crawl for electronicintifada a week ago
- problem: speed of the crawls: more articles than before, with volume, taking longer, and keeping sleeps
- added script to send email once the crawl is finished
- can't really see how many downloads have been completed in the middle of a crawl

managed to figure out the dataset issue from before, but now there's a new error in the processing
3 sections: (1) input-fine, (2) processor - issue, (3) output

send Alejandro once the electronicintifada IA crawl is done with estimate of speed - Ra
figure out estimate actual articles for NYT and document protocol for estimating actual articles on IA - Ra
send email about URLs and new URLs and anything else about estimating actual articles - RA
Processor issue - Fr
look at developing unit testing for foxnews twitter postprocessed results, for example, on text alias & hyperlinks - Fr
if time over break, begin to look into combining crawl results and eliminating duplicates - Gy
- start with a small batch of each one - 50-100?
- move combined results to new folder so that original crawl results are preserved until assured that we have the right ones
- document protocol used to combine results
contact Nat about new kind of error with Arbutus cloud with nytimes re-start - Gy
attempt re-start of NYT Archive Mid E (both Graham and Arbutus IPs) - Gy