December 15, 2020 - UTMediaCAT/mediacat-docs GitHub Wiki
Meeting Notes
- Add mini-processor
- makes change from one giant CSV to smaller CSVs
- completed, to be reviewed by Amy
- Crawler performance - currently too slow
- crawls 14 000 links a day, this is slow compared to Nat's old crawler
- needs to be investigated by benchmarking several links and seeing how operating time can be improved
- APify Puppeteer crawler has a specific amount of RAM it uses for the concurrent crawls, and this can be set to a different amount
- Domains that are only returning 1 link - some block is being encountered
- paywall? although this seems weird because 972mag is one example that has this issue, that does not have a paywall
- need to investigate to what extent this is an issue caused by the domain being examined or by Mediacat's operation
- Mediacat Domain crawler
- need to confirm that the crawler operates in an asynchronous manner
- if it's stuck on something - this could be a synchronous issue
- large dictionary JSON is still being built while domain crawler crawls, this may be taking up memory
- to be removed and see if this has a positive impact on performance
- Integrate metascraper crawl to operate on the Puppeteer crawler output
- work to be continued on this
- Branch clean-up should be conducted as soon as possible
- as much as possible, should be working from the master branch
- creating smaller-scope issues and branches to maintain project
- Accepting a CSV file from parser to populate initial queue
- tested and completed, closed issue
- Create post-processor framework
- currently being run in virtual machine, waiting on output to verify this is working the way we want it to
- need to track the time it is taking for the amount of data processed, for benchmarking purposes