December 22, 2020 - UTMediaCAT/mediacat-docs GitHub Wiki
- Outcome of crawler meeting
- Outcome of Danhua testing of the post-processor
- Issues Review
Meeting Notes
- Cheerio crawler would be faster, but doesn't render Javascript
- going to try an implementation of Cheerio crawler, to see which sites still will work with this version
- if they do, this may improve speed of crawler
- Phantom JS is blocked by many websites, so not a good option
- Puppeteer is likely as fast as we are going to get if we want to render the entire page with Javascript
- Looking into doing more crawls concurrently
- If all the instances are crawling from a queue, we're going to have to figure out how to handle the queue itself when links are being added
- ex. What defines the first instance queue, the second? And how do we allocate the queue links?
- We don't have a definite way of sharing the queue
- Alex has started testing the database