December 08, 2020 - UTMediaCAT/mediacat-docs GitHub Wiki

  • Crawler stopped - can we refactor to address issue?
  • Issues Review

Meeting notes

  • Crawler stopped
    • whatever error that occurred, it was not registered in the debug.log - so this needs to be addressed
    • memory issue, so it may be that too many links are being crawled and this can't be handled by the processing power
    • Puppeteer does async calls, so it is making multiple calls concurrently
    • need to understand error before we conclude that there is not enough memory
    • not short on storage, short on memory
    • when it fails, is it possible to pick up where it left off? The queue is maintained in a folder "pending"
      • if crawler is re-started, it should start from this "pending" folder again
    • relevant stackoverflow link
    • detecting memory leaks with Puppeteer
    • meeting with system administrator set up
    • implementing database will also allow us to track where we are in scope crawl
  • Twitter data quotation marks issue resolved with the implementation of mini-processor
    • Add mini-processor issue to be reviewed by Amy
  • Integrate Metascraper crawl to operate on the Puppeteer crawler output
    • implement tracking for whether date was retrieved
    • Jacqueline's new database will be used by metascraper
  • Mediacat domain crawler
    • working on this week
  • Accepting a .csv file from the parser to populate initial queue
    • bugs being resolved
    • does not need to be resolved before restarting the crawl
  • Create post-processor framework
    • this is completed for the most part, needs to be revised to accept the smaller individual CSVs
    • Danhua will then test this
  • New repository with API based code
    • reshaped input, but will still need some work so it can be fed into mini-processor