July 08, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

Domain Crawler
Twitter Crawl
compute canada domain question

Twitter Crawl

Some of the handles didn't get crawled
we can do a simple diff, and then see which didn't get crawled; Raiyan will send a google sheet to Kirsta & Alejandro
Alejandro & Kirsta will go through failed ones to see why they didn't work
doesn't seem to have stopped because of the error, but rather have completed the crawl

Domain Crawler

Raiyan created a script named "master crawler" to run the crawler, including how often it will re-set
constantly crawling 13.5 links per minute, and it is crawling much better:
in 13 hours, 10000+ crawled w/o errors or memory issues
need some in-code documentation with the new changes
and then higher level documentation for the MVP

Domain Crawler Log Files

renamed to time stamp for when they finish and moved to the logs folder
and don't erase each other
very important to keep these in mind for trouble-shooting

Compute Canada

one IP they mentioned wasn't actually ours
updated operating system: some things broke, and needed to upgrade
might be worthwhile upgrading our code to look at Apify 1.0 and to match -- need to consider this before Raiyan leaves

problem of folders that have proliferated

need to consider deprecating
need to remove folders and data to free up storage
Alejandro will schedule a zoom call with Raiyan to make decisions for deletion

Raiyan priorities for next week:

Twitter handle problem, documentation and pushing to master, and then considering how to start 1+ instance on 1 server