Jan 11, 2024 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • if time over break, begin to look into combining crawl results and eliminating duplicates - Gy
    • start with a small batch of each one - 50-100?
    • move combined results to new folder so that original crawl results are preserved until assured that we have the right ones
    • document protocol used to combine results
  • contact Nat about new kind of error with Arbutus cloud with nytimes re-start - Gy
  • attempt re-start of NYT Archive Mid E (both Graham and Arbutus IPs) - Gy
  • figure out CDX API with some of the features mentioned by IA developer - RA
  • once using new API - set up electronicintifada & Jadaliyya crawls on IA - Ra
  • figure out estimate actual articles for NYT and document protocol for estimating actual articles on IA - Ra
  • send email about URLs and new URLs and anything else about estimating actual articles - RA
  • follow with Nat to find time for next week if possible - Fr

Crawls

IA Crawl

  • changed the code for the CDX API, seems to be working better
    • filtering before might have been too aggressive
  • set up crawl for electronicintifada a week ago
    • problem: speed of the crawls: more articles than before, with volume, taking longer, and keeping sleeps
    • added script to send email once the crawl is finished
    • can't really see how many downloads have been completed in the middle of a crawl

Postprocessor

  • managed to figure out the dataset issue from before, but now there's a new error in the processing
  • 3 sections: (1) input-fine, (2) processor - issue, (3) output

Action Items

  • send Alejandro once the electronicintifada IA crawl is done with estimate of speed - Ra
  • figure out estimate actual articles for NYT and document protocol for estimating actual articles on IA - Ra
  • send email about URLs and new URLs and anything else about estimating actual articles - RA
  • Processor issue - Fr
  • look at developing unit testing for foxnews twitter postprocessed results, for example, on text alias & hyperlinks - Fr
  • if time over break, begin to look into combining crawl results and eliminating duplicates - Gy
    • start with a small batch of each one - 50-100?
    • move combined results to new folder so that original crawl results are preserved until assured that we have the right ones
    • document protocol used to combine results
  • contact Nat about new kind of error with Arbutus cloud with nytimes re-start - Gy
  • attempt re-start of NYT Archive Mid E (both Graham and Arbutus IPs) - Gy