December 15, 2020 - UTMediaCAT/mediacat-docs GitHub Wiki

Meeting Notes

  • Add mini-processor
    • makes change from one giant CSV to smaller CSVs
    • completed, to be reviewed by Amy
  • Crawler performance - currently too slow
    • crawls 14 000 links a day, this is slow compared to Nat's old crawler
    • needs to be investigated by benchmarking several links and seeing how operating time can be improved
    • APify Puppeteer crawler has a specific amount of RAM it uses for the concurrent crawls, and this can be set to a different amount
  • Domains that are only returning 1 link - some block is being encountered
    • paywall? although this seems weird because 972mag is one example that has this issue, that does not have a paywall
    • need to investigate to what extent this is an issue caused by the domain being examined or by Mediacat's operation
  • Mediacat Domain crawler
    • need to confirm that the crawler operates in an asynchronous manner
    • if it's stuck on something - this could be a synchronous issue
    • large dictionary JSON is still being built while domain crawler crawls, this may be taking up memory
      • to be removed and see if this has a positive impact on performance
  • Integrate metascraper crawl to operate on the Puppeteer crawler output
    • work to be continued on this
  • Branch clean-up should be conducted as soon as possible
    • as much as possible, should be working from the master branch
    • creating smaller-scope issues and branches to maintain project
  • Accepting a CSV file from parser to populate initial queue
    • tested and completed, closed issue
  • Create post-processor framework
    • currently being run in virtual machine, waiting on output to verify this is working the way we want it to
    • need to track the time it is taking for the amount of data processed, for benchmarking purposes