December 22, 2020 - UTMediaCAT/mediacat-docs GitHub Wiki

  • Outcome of crawler meeting
  • Outcome of Danhua testing of the post-processor
  • Issues Review

Meeting Notes

  • Cheerio crawler would be faster, but doesn't render Javascript
    • going to try an implementation of Cheerio crawler, to see which sites still will work with this version
    • if they do, this may improve speed of crawler
  • Phantom JS is blocked by many websites, so not a good option
  • Puppeteer is likely as fast as we are going to get if we want to render the entire page with Javascript
  • Looking into doing more crawls concurrently
    • If all the instances are crawling from a queue, we're going to have to figure out how to handle the queue itself when links are being added
    • ex. What defines the first instance queue, the second? And how do we allocate the queue links?
  • We don't have a definite way of sharing the queue
  • Alex has started testing the database