Dec 14, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki
Servers
- folders on servers: for each one put on crawl index
- many results probably are duplicates -- need protocol to think about combination
- Raazia doesn't see any changes to Arbutus
IA crawling
- error with electronic intifada happening some times
- recommendation from IA developer is to move to CDX API instead of Availability API
- CDX seems to include all results in one call
- this could solve the issue with Availability API that would only give one URL at a time
- CDX order: give me all results - then construct the IA constructed URL - download the html
Postprocessing
Action Items
- if time over break, begin to look into combining crawl results and eliminating duplicates - Gy
- start with a small batch of each one - 50-100?
- move combined results to new folder so that original crawl results are preserved until assured that we have the right ones
- document protocol used to combine results
- contact Nat about new kind of error with Arbutus cloud with nytimes re-start - Gy
- attempt re-start of NYT Archive Mid E (both Graham and Arbutus IPs) - Gy
- figure out CDX API with some of the features mentioned by IA developer - RA
- once using new API - set up electronicintifada & Jadaliyya crawls on IA - Ra
- figure out estimate actual articles for NYT and document protocol for estimating actual articles on IA - Ra
- send email about URLs and new URLs and anything else about estimating actual articles - RA
- follow with Nat to find time for next week if possible - Fr