Dec 14, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki

Servers

  • folders on servers: for each one put on crawl index
    • many results probably are duplicates -- need protocol to think about combination
  • Raazia doesn't see any changes to Arbutus

IA crawling

  • error with electronic intifada happening some times
  • recommendation from IA developer is to move to CDX API instead of Availability API
    • CDX seems to include all results in one call
    • this could solve the issue with Availability API that would only give one URL at a time
  • CDX order: give me all results - then construct the IA constructed URL - download the html

Postprocessing

  • need more time

Action Items

  • if time over break, begin to look into combining crawl results and eliminating duplicates - Gy
    • start with a small batch of each one - 50-100?
    • move combined results to new folder so that original crawl results are preserved until assured that we have the right ones
    • document protocol used to combine results
  • contact Nat about new kind of error with Arbutus cloud with nytimes re-start - Gy
  • attempt re-start of NYT Archive Mid E (both Graham and Arbutus IPs) - Gy
  • figure out CDX API with some of the features mentioned by IA developer - RA
  • once using new API - set up electronicintifada & Jadaliyya crawls on IA - Ra
  • figure out estimate actual articles for NYT and document protocol for estimating actual articles on IA - Ra
  • send email about URLs and new URLs and anything else about estimating actual articles - RA
  • follow with Nat to find time for next week if possible - Fr