January 6, 2022 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

On-boarding Shengsong

  • Shengsong will get into Compute Canada security issues as a way to get to know the resources:
    • If you have not applied OS updates recently, make time to schedule an outage of your instance to apply operating system and application updates.
    • Review your security group rules and lock down access to services to as few remote IP prefixes as possible.
    • Delete cloud instances you are no longer using/maintaining as those pose a security risk and are consuming valuable resources.
  • use this an opportunity to map out all the resources on the Compute Canada
  • create a new instance, install the software, and crawl nytimes.com
    • running the latest code on nytimes.com to see the benchmark speed of the current crawler

looking at results:

  • best solution is to set up jupyter hub on Compute Canada
  • Colin: port scripts from existing notebook into hub
  • thus wouldn't need to optimize CSV preparation

issues with al-monitor.com results:

  • need to do forensics to figure out errors
  • we will ask Shengsong to look at this after he has fully gotten to work on Compute Canada

Action Items:

  • Shengsong: Compute Canada updates and mapping
  • Shengsong: setting up an instance of the domain crawler for nytimes.com
  • Colin: setting up a jupyter hub on our resources
  • next steps for Shengsong:
    • looking to see if Twint issues are resolved and a crawl can be done
    • forensics on issues with al-monitor crawl (see link above)