January 14, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Meeting Notes

  • Crawler performance
    • Puppeteer restarted with max concurrency, doesn't seem to address the performance issue
      • Jacqueline is writing tests to identify why we are seeing this slow performance from Puppeteer
    • Jacqueline and Raiyan got Cheerio working, and it retrieves the relevant data despite the lack of Javascript rendering
      • Cheerio is being run on a test instance, in one day it looked at 13,000 links, and it self-stopped
      • Cheerio does not have the "headless browser" and this could be an issue for site blocks
    • Raiyan and Jacqueline will continue to investigate, with a sample of the scope that includes sites returning a 1 count
    • Currently we are using puppeteer through Apify SDK - selecting media to not render would require making changes through Puppeteer directly without Apify SDK
  • Metascraper
    • Alex added metascraper data columns, and created corresponding tests
  • Mediacat Domain Crawler PR merged
  • Post-processor
    • Amy added logic to interest output for sorting
    • Amy to run the post-processor on the whole Twitter output