Feb 15, 2024 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • continue combinging results and document (reading week) - Gy
  • monitor NYT archive and send email to Alejandro to update - Gy
  • follow up with IA regarding connection refused error - Ra
  • ask Nat for meeting about connection refused error - Ra
  • try another kind of crawl to see if there's a refused error - Ra
  • try to update version of node to see if that helps - Ra
  • take a sample of Wa/Po and see if can reproduce the right result - Fr
  • follow up by email about Wa/Po output number - Fr
  • unit-testing each function of the postprocessor for IA dataset - Ar
  • if unit-testing shows accuracy, then request IA Electronic Intifada dataset from Raazia and proceed with postprocessing - Ar

Crawl

  • need at least 2000 a day on NYT Mid E archive crawl

Internet Archive

  • Nat helped with workarounds:
    • separate downloading of urls through cdx from crawling
    • storing failed responses and try them again (but switching through successful responses)
    • randomizing the pagination attempts
    • filtering after download of urls, check for duplicate urls before assigning them to the queue for crawling
  • some of the problem may be the responsisveness of IA servers, so slow down requests

Action Items

  • test NYT Mid E Archive crawl with a speed of 2000 results a day, if possible, then continue, if not, then abandon - Gy
  • continue combinging results and document (reading week) - Gy
  • integrating Nat's suggestions and testing again the NYT Mid E - Ra