Cheerio works faster than Puppeteer, but will not render Javascript and is visiting unexpected URLs
APify sdk is what is currently used with Puppeteer pools. Using puppeteer directly could be an option, but we anticipate it will have the same performance issues. Since rendering is slow, we can look into selecting certain portions of Javascript to render
MediaCAT crawl with Puppeteer is to be re-started with changes to the max and min concurrency to see if this improves performance
Troubleshooting with Cheerio to be continued
Metascraper crawl for dates
The date retrieved is pushed to the database
Next step is creating a sample set of article links to test what metascraper is able to retrieve
Before this, we are focusing on post-processor and crawler
Some domains only return one/ a few links back
This is understood as the links crawled rather than the links discovered - so the low numbers are likely due to our large scope
Still being investigated
MediaCAT domain crawler
Revision to remove the large JSON, instead create smaller JSONs completed
Post-processor framework
Benchmarking wasn't completed with the whole dataset, still unsure how long post-processing will take
New repository with API-based code completed and merged