March 22, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • catch up

Crawl issues

  • removed tabletmag & stored separately
  • tried to change the pause, and now running faster
  • aljazeera.com still giving some problems, but not as big an issue

Israeli news site crawl

  • started setting up the directory

Postprocessor

  • twitter crawl - Shengsong: expanding short URLs
    • managed to run a smaller file to not error
    • the issue: last line of some of the output files would get cut off; causes issues with anything trying to parse
    • trying to run with larger file; expand short URLs

Action Items

  • Israeli news sites: finish setting up directory and start crawl
  • keep working on postprocessor

For new work studies:

  • figure out why last line output file in Twitter crawl is being cut out.
  • issue of email when crawler breaks