July 08, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
- Domain Crawler
- Twitter Crawl
- compute canada domain question
Twitter Crawl
- Some of the handles didn't get crawled
- we can do a simple diff, and then see which didn't get crawled; Raiyan will send a google sheet to Kirsta & Alejandro
- Alejandro & Kirsta will go through failed ones to see why they didn't work
- doesn't seem to have stopped because of the error, but rather have completed the crawl
Domain Crawler
- Raiyan created a script named "master crawler" to run the crawler, including how often it will re-set
- constantly crawling 13.5 links per minute, and it is crawling much better:
- in 13 hours, 10000+ crawled w/o errors or memory issues
- need some in-code documentation with the new changes
- and then higher level documentation for the MVP
Domain Crawler Log Files
- renamed to time stamp for when they finish and moved to the logs folder
- and don't erase each other
- very important to keep these in mind for trouble-shooting
Compute Canada
- one IP they mentioned wasn't actually ours
- updated operating system: some things broke, and needed to upgrade
- might be worthwhile upgrading our code to look at Apify 1.0 and to match -- need to consider this before Raiyan leaves
problem of folders that have proliferated
- need to consider deprecating
- need to remove folders and data to free up storage
- Alejandro will schedule a zoom call with Raiyan to make decisions for deletion
Raiyan priorities for next week:
- Twitter handle problem, documentation and pushing to master, and then considering how to start 1+ instance on 1 server