December 08, 2020 - UTMediaCAT/mediacat-docs GitHub Wiki

Crawler stopped - can we refactor to address issue?
Issues Review

Meeting notes

Crawler stopped
- whatever error that occurred, it was not registered in the debug.log - so this needs to be addressed
- memory issue, so it may be that too many links are being crawled and this can't be handled by the processing power
- Puppeteer does async calls, so it is making multiple calls concurrently
- need to understand error before we conclude that there is not enough memory
- not short on storage, short on memory
- when it fails, is it possible to pick up where it left off? The queue is maintained in a folder "pending"
  - if crawler is re-started, it should start from this "pending" folder again
- relevant stackoverflow link
- detecting memory leaks with Puppeteer
- meeting with system administrator set up
- implementing database will also allow us to track where we are in scope crawl
Twitter data quotation marks issue resolved with the implementation of mini-processor
- Add mini-processor issue to be reviewed by Amy
Integrate Metascraper crawl to operate on the Puppeteer crawler output
- implement tracking for whether date was retrieved
- Jacqueline's new database will be used by metascraper
Mediacat domain crawler
- working on this week
Accepting a .csv file from the parser to populate initial queue
- bugs being resolved
- does not need to be resolved before restarting the crawl
Create post-processor framework
- this is completed for the most part, needs to be revised to accept the smaller individual CSVs
- Danhua will then test this
New repository with API based code
- reshaped input, but will still need some work so it can be fed into mini-processor