Dec 14, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki

Servers

folders on servers: for each one put on crawl index
- many results probably are duplicates -- need protocol to think about combination
Raazia doesn't see any changes to Arbutus

error with electronic intifada happening some times
recommendation from IA developer is to move to CDX API instead of Availability API
- CDX seems to include all results in one call
- this could solve the issue with Availability API that would only give one URL at a time
CDX order: give me all results - then construct the IA constructed URL - download the html

if time over break, begin to look into combining crawl results and eliminating duplicates - Gy
- start with a small batch of each one - 50-100?
- move combined results to new folder so that original crawl results are preserved until assured that we have the right ones
- document protocol used to combine results
contact Nat about new kind of error with Arbutus cloud with nytimes re-start - Gy
attempt re-start of NYT Archive Mid E (both Graham and Arbutus IPs) - Gy
figure out CDX API with some of the features mentioned by IA developer - RA
once using new API - set up electronicintifada & Jadaliyya crawls on IA - Ra
figure out estimate actual articles for NYT and document protocol for estimating actual articles on IA - Ra
send email about URLs and new URLs and anything else about estimating actual articles - RA
follow with Nat to find time for next week if possible - Fr