Jan 18, 2024 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • send Alejandro once the electronicintifada IA crawl is done with estimate of speed - Ra
  • figure out estimate actual articles for NYT and document protocol for estimating actual articles on IA - Ra
  • send email about URLs and new URLs and anything else about estimating actual articles - RA
  • Processor issue - Fr
  • look at developing unit testing for foxnews twitter postprocessed results, for example, on text alias & hyperlinks - Fr
  • if time over break, begin to look into combining crawl results and eliminating duplicates - Gy
    • start with a small batch of each one - 50-100?
    • move combined results to new folder so that original crawl results are preserved until assured that we have the right ones
    • document protocol used to combine results
  • contact Nat about new kind of error with Arbutus cloud with nytimes re-start - Gy
  • attempt re-start of NYT Archive Mid E (both Graham and Arbutus IPs) - Gy

IA crawl

  • issue on speed: crawl didn't progress after last meeting, just stopped

    • no error message
    • no blocked message
    • seemed to do a comment URL before stopping; the comment URL seemed to be captured
  • Postprocessing:

  • lots of logging to try and get at processing issue, each round takes 90 minutes

  • meeting Francisco & Aryan to look at unit-testing

Action Items

  • if time over break, begin to look into combining crawl results and eliminating duplicates - Gy
    • start with a small batch of each one - 50-100?
    • move combined results to new folder so that original crawl results are preserved until assured that we have the right ones
    • document protocol used to combine results
  • contact Nat about new kind of error with Arbutus cloud with nytimes re-start - Gy
  • attempt re-start of NYT Archive Mid E (both Graham and Arbutus IPs) - Gy
  • try to restart IA crawl and if it doesn't restart, send email inquiry about being blocked - Ra
  • figure out estimate actual articles for NYT and document protocol for estimating actual articles on IA - Ra