July 08, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • Domain Crawler
  • Twitter Crawl
  • compute canada domain question

Twitter Crawl

  • Some of the handles didn't get crawled
  • we can do a simple diff, and then see which didn't get crawled; Raiyan will send a google sheet to Kirsta & Alejandro
  • Alejandro & Kirsta will go through failed ones to see why they didn't work
  • doesn't seem to have stopped because of the error, but rather have completed the crawl

Domain Crawler

  • Raiyan created a script named "master crawler" to run the crawler, including how often it will re-set
  • constantly crawling 13.5 links per minute, and it is crawling much better:
  • in 13 hours, 10000+ crawled w/o errors or memory issues
  • need some in-code documentation with the new changes
  • and then higher level documentation for the MVP

Domain Crawler Log Files

  • renamed to time stamp for when they finish and moved to the logs folder
  • and don't erase each other
  • very important to keep these in mind for trouble-shooting

Compute Canada

  • one IP they mentioned wasn't actually ours
  • updated operating system: some things broke, and needed to upgrade
  • might be worthwhile upgrading our code to look at Apify 1.0 and to match -- need to consider this before Raiyan leaves

problem of folders that have proliferated

  • need to consider deprecating
  • need to remove folders and data to free up storage
  • Alejandro will schedule a zoom call with Raiyan to make decisions for deletion

Raiyan priorities for next week:

  • Twitter handle problem, documentation and pushing to master, and then considering how to start 1+ instance on 1 server