Jan 11, 2024 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
- if time over break, begin to look into combining crawl results and eliminating duplicates - Gy
- start with a small batch of each one - 50-100?
- move combined results to new folder so that original crawl results are preserved until assured that we have the right ones
- document protocol used to combine results
- contact Nat about new kind of error with Arbutus cloud with nytimes re-start - Gy
- attempt re-start of NYT Archive Mid E (both Graham and Arbutus IPs) - Gy
- figure out CDX API with some of the features mentioned by IA developer - RA
- once using new API - set up electronicintifada & Jadaliyya crawls on IA - Ra
- figure out estimate actual articles for NYT and document protocol for estimating actual articles on IA - Ra
- send email about URLs and new URLs and anything else about estimating actual articles - RA
- follow with Nat to find time for next week if possible - Fr
Crawls
IA Crawl
- changed the code for the CDX API, seems to be working better
- filtering before might have been too aggressive
- set up crawl for electronicintifada a week ago
- problem: speed of the crawls: more articles than before, with volume, taking longer, and keeping sleeps
- added script to send email once the crawl is finished
- can't really see how many downloads have been completed in the middle of a crawl
Postprocessor
- managed to figure out the dataset issue from before, but now there's a new error in the processing
- 3 sections: (1) input-fine, (2) processor - issue, (3) output
Action Items
- send Alejandro once the electronicintifada IA crawl is done with estimate of speed - Ra
- figure out estimate actual articles for NYT and document protocol for estimating actual articles on IA - Ra
- send email about URLs and new URLs and anything else about estimating actual articles - RA
- Processor issue - Fr
- look at developing unit testing for foxnews twitter postprocessed results, for example, on text alias & hyperlinks - Fr
- if time over break, begin to look into combining crawl results and eliminating duplicates - Gy
- start with a small batch of each one - 50-100?
- move combined results to new folder so that original crawl results are preserved until assured that we have the right ones
- document protocol used to combine results
- contact Nat about new kind of error with Arbutus cloud with nytimes re-start - Gy
- attempt re-start of NYT Archive Mid E (both Graham and Arbutus IPs) - Gy