Oct 26, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki

Crawler/server

  • Graham cloud is working now:
    • one blocked domain still blocked

Internet Archive crawl:

  • trying to collect on Mondoweiss
  • 17,000 snapshots crawled since Monday, some subset are actual articles
  • guidelines don't mention any wait times or the like
  • so far no sign of being blocked
  • metascraper needs to make a crawl, but got rid of one call

Postprocessor

  • seems to be working
  • need
  • unit test: *

Action Items:

  • try other domains on the Graham instance - Gy
  • re-start the small domain crawler on Graham - Gy
  • add counter for IA crawler - Ra
  • see how many were actual articles from Mondoweiss IA crawl - Ra
  • enter Mondoweiss IA Crawl into the Crawl index - Ra
  • on Saturday will create a 200 set result from Mondoweiss IA crawl, email Francisco and Alejandro - Ra
  • unit test for postprocessing - start developing - Ar
  • postprocess Washington Post Twitter results - Fr
  • check if difficult to accept article length for postprocessor - Ar
  • remove vulnerable files/libraries from archived postprocessor - Fr
  • look at adding article length (not crucial) - Ra
  • check if postprocessor applies tags to citations and citing articles - Fr
  • delete data on small instance that's running with local storage - Gy
  • check postprocessed result of small data set from Raazia - Fr