Jan 18, 2024 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
- send Alejandro once the electronicintifada IA crawl is done with estimate of speed - Ra
- figure out estimate actual articles for NYT and document protocol for estimating actual articles on IA - Ra
- send email about URLs and new URLs and anything else about estimating actual articles - RA
- Processor issue - Fr
- look at developing unit testing for foxnews twitter postprocessed results, for example, on text alias & hyperlinks - Fr
- if time over break, begin to look into combining crawl results and eliminating duplicates - Gy
- start with a small batch of each one - 50-100?
- move combined results to new folder so that original crawl results are preserved until assured that we have the right ones
- document protocol used to combine results
- contact Nat about new kind of error with Arbutus cloud with nytimes re-start - Gy
- attempt re-start of NYT Archive Mid E (both Graham and Arbutus IPs) - Gy
IA crawl
-
issue on speed: crawl didn't progress after last meeting, just stopped
- no error message
- no blocked message
- seemed to do a comment URL before stopping; the comment URL seemed to be captured
-
Postprocessing:
-
lots of logging to try and get at processing issue, each round takes 90 minutes
-
meeting Francisco & Aryan to look at unit-testing
Action Items
- if time over break, begin to look into combining crawl results and eliminating duplicates - Gy
- start with a small batch of each one - 50-100?
- move combined results to new folder so that original crawl results are preserved until assured that we have the right ones
- document protocol used to combine results
- contact Nat about new kind of error with Arbutus cloud with nytimes re-start - Gy
- attempt re-start of NYT Archive Mid E (both Graham and Arbutus IPs) - Gy
- try to restart IA crawl and if it doesn't restart, send email inquiry about being blocked - Ra
- figure out estimate actual articles for NYT and document protocol for estimating actual articles on IA - Ra