May 13, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • Next Week: Onboarding John (KS won't be at meeting)

  • Progress on the Crawler (Raiyan): Raiyan got the batch crawler running. Raiyan provides the full scope and optionally provides the number of pages per "round" - it goes through all the domains and by default crawls five pages. Creates a queue for each separate domain. Each queue handles only a single domain. Each domain is given equal amounts of time to crawl through. This lets us see if any domains are problem domains where only a single page might be crawled. Also managed to save state so that if we stop the crawler in the middle we can pick up where we left off. Once that was done, he refactored a lot of the batch crawler and made it more modular so that it's easier to read. Domain crawler which contains batch crawler. Should be deployable by the end of the week. PDF crawler has been implemented. Desirable to only have PDFs of hits and this might require additional work. We will not use PDF crawling in the next phase of this project. Raiyan believes we may be able to conditionally create PDFs. This is next phase development and not for this phase.

  • Game plan for summer (set up multiple instances for crawler?)

What we have: Two large servers associated with two projects on Graham cloud and three small testing servers on Arbutus associated with one project, AND three UTSC servers.

Next Steps for Crawler:

  1. Raiyan will merge and tidy repositories/docs and then tell Kirsta so that she knows it's ready to pass to John.
  2. Raiyan will install the completed code (this phase)on an Arbutus server and start running it so we can benchmark performance
  3. Next meeting: Decide how to run the code on Graham cloud.

Next Steps for Jacqueline

  1. Sort out the security issue on the main project website.
  2. Prep for onboarding
  3. Help with Raiyan/John for questions
  4. If nothing to do, NYT api ?
  • Meeting Alejandro and Kirsta re: students and partnerships with knowledgeproject? (Scheduled for tomorrow)