Nov 2, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki

Postprocessor

  • no errors in processing of scraped results
  • Wa/Po: trying to clean dataset and getting errors
  • IA results: problem with postprocessor expecting CSV not JSON

server/crawler

  • finalizing Mondoweiss in IA: 36,147 successful (not landing page, photos, etc); about 500 not relevant

Action Items

  • develop unit testing for foxnews postprocessed rsults, for example, on text alias - Ar
  • Wa/Po twitter data set: look for lines producing errors - Fr
  • look for converter for CSV/JSON - Ar
  • add debugging to IA crawler like total crawled - Ra
  • add documentation about filtering out irrelevant URLs for IA crawler - Ra
  • start crawling electronicintifada and nytimes - Ra
  • sending email Gy asking about multiple crawlers running at the same time - Ra
  • sending email to Nat about difference b/w URLs and new URLs in archive.org data - Ra