Oct 26, 2023 - UTMediaCAT/mediacat-docs GitHub Wiki
Crawler/server
- Graham cloud is working now:
- one blocked domain still blocked
Internet Archive crawl:
- trying to collect on Mondoweiss
- 17,000 snapshots crawled since Monday, some subset are actual articles
- guidelines don't mention any wait times or the like
- so far no sign of being blocked
- metascraper needs to make a crawl, but got rid of one call
Postprocessor
- seems to be working
- need
- unit test:
*
Action Items:
- try other domains on the Graham instance - Gy
- re-start the small domain crawler on Graham - Gy
- add counter for IA crawler - Ra
- see how many were actual articles from Mondoweiss IA crawl - Ra
- enter Mondoweiss IA Crawl into the Crawl index - Ra
- on Saturday will create a 200 set result from Mondoweiss IA crawl, email Francisco and Alejandro - Ra
- unit test for postprocessing - start developing - Ar
- postprocess Washington Post Twitter results - Fr
- check if difficult to accept article length for postprocessor - Ar
- remove vulnerable files/libraries from archived postprocessor - Fr
- look at adding article length (not crucial) - Ra
- check if postprocessor applies tags to citations and citing articles - Fr
- delete data on small instance that's running with local storage - Gy
- check postprocessed result of small data set from Raazia - Fr