January 05, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Meeting notes

  • Crawler speed performance issues
    • Cheerio works faster than Puppeteer, but will not render Javascript and is visiting unexpected URLs
    • APify sdk is what is currently used with Puppeteer pools. Using puppeteer directly could be an option, but we anticipate it will have the same performance issues. Since rendering is slow, we can look into selecting certain portions of Javascript to render
    • MediaCAT crawl with Puppeteer is to be re-started with changes to the max and min concurrency to see if this improves performance
    • Troubleshooting with Cheerio to be continued
  • Metascraper crawl for dates
    • The date retrieved is pushed to the database
    • Next step is creating a sample set of article links to test what metascraper is able to retrieve
    • Before this, we are focusing on post-processor and crawler
  • Some domains only return one/ a few links back
    • This is understood as the links crawled rather than the links discovered - so the low numbers are likely due to our large scope
    • Still being investigated
  • MediaCAT domain crawler
    • Revision to remove the large JSON, instead create smaller JSONs completed
  • Post-processor framework
    • Benchmarking wasn't completed with the whole dataset, still unsure how long post-processing will take
  • New repository with API-based code completed and merged