March 22, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
Crawl issues
- removed tabletmag & stored separately
- tried to change the pause, and now running faster
- aljazeera.com still giving some problems, but not as big an issue
Israeli news site crawl
- started setting up the directory
Postprocessor
- twitter crawl - Shengsong: expanding short URLs
- managed to run a smaller file to not error
- the issue: last line of some of the output files would get cut off; causes issues with anything trying to parse
- trying to run with larger file; expand short URLs
Action Items
- Israeli news sites: finish setting up directory and start crawl
- keep working on postprocessor
For new work studies:
- figure out why last line output file in Twitter crawl is being cut out.
- issue of email when crawler breaks