January 05, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Meeting notes

Crawler speed performance issues
- Cheerio works faster than Puppeteer, but will not render Javascript and is visiting unexpected URLs
- APify sdk is what is currently used with Puppeteer pools. Using puppeteer directly could be an option, but we anticipate it will have the same performance issues. Since rendering is slow, we can look into selecting certain portions of Javascript to render
- MediaCAT crawl with Puppeteer is to be re-started with changes to the max and min concurrency to see if this improves performance
- Troubleshooting with Cheerio to be continued
Metascraper crawl for dates
- The date retrieved is pushed to the database
- Next step is creating a sample set of article links to test what metascraper is able to retrieve
- Before this, we are focusing on post-processor and crawler
Some domains only return one/ a few links back
- This is understood as the links crawled rather than the links discovered - so the low numbers are likely due to our large scope
- Still being investigated
MediaCAT domain crawler
- Revision to remove the large JSON, instead create smaller JSONs completed
Post-processor framework
- Benchmarking wasn't completed with the whole dataset, still unsure how long post-processing will take
New repository with API-based code completed and merged