January 21, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Meeting Notes

  • Crawler performance investigation
    • Jacqueline wrote tests and shared with Raiyan
    • Investigation of 1 count in certain URLs
      • Unsure if this could be due to structure of Cheerio or Puppeteer
      • URLs are being found successfully but seems like they are not being added to the queue - manual queueing is being considered
      • Certain domains were http rather than https
    • When checking if a link found on a crawled page is in scope, we should not be checking for http or https
    • For verification of URLs, most websites will likely be https
    • Raiyan will make the change verifying the https header, and restart crawl
  • Raiyan looked at selectively rendering with Puppeteer, and can successfully block images now (still working on removing videos) * After incorporating this selection, speed can then be re-tested
  • Jacqueline found that some sites don't work with Cheerio that work with Puppeteer
  • Alex made PR for integrating metascraper date data
  • Post-processor run attempted
    • Clarification of order and execution of post-processor and mini-processor on Twitter data needed