January 21, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Meeting Notes

Crawler performance investigation
- Jacqueline wrote tests and shared with Raiyan
- Investigation of 1 count in certain URLs
  - Unsure if this could be due to structure of Cheerio or Puppeteer
  - URLs are being found successfully but seems like they are not being added to the queue - manual queueing is being considered
  - Certain domains were http rather than https
- When checking if a link found on a crawled page is in scope, we should not be checking for http or https
- For verification of URLs, most websites will likely be https
- Raiyan will make the change verifying the https header, and restart crawl
Raiyan looked at selectively rendering with Puppeteer, and can successfully block images now (still working on removing videos) * After incorporating this selection, speed can then be re-tested
Jacqueline found that some sites don't work with Cheerio that work with Puppeteer
Alex made PR for integrating metascraper date data
Post-processor run attempted
- Clarification of order and execution of post-processor and mini-processor on Twitter data needed