February 04, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Meeting Notes

  • Post-processor framework
    • Connection running post-processor stopped with a "socket broke connection failed" error
    • How could we make the post-processor not have to restart from the beginning if such an error rises?
    • 1,000,000/5,000,000 entries completed
    • Could approach this by dividing the content, multithreading
    • Need the full dataset to establish relationships between items, but can the dataset be split up for actual post-processing?
    • Create dictionary of user handles, and track their completion in this dictionary
      • This dictionary would have to be written to a file so that you would have it if the post-processor fails
      • Would have to find a way to share the dictionary between all processes
  • Upgrade sudo for all instances on Compute Canada
    • On hold for instance being used until it's done, should be scheduled in
  • Crawler performance
    • Sites asking for cookie permissions need to be handled
      • I don't care about cookies extension
    • Raiyan configured selective downloading so text available without having to wait through images, other media
    • Need to find a way to handle popups
    • Using APify Puppeteer rather than Puppeteer alone makes it difficult to add extensions
    • Asynchronous crawl, but because we are waiting to resolve, it is by nature synchronous
      • Writing asynchronously requires processing after crawl, not during
    • Puppeteer (non-APify) needs to be revisited, as well as solving pop-up problem
    • Puppeteer developers have implemented a "stealth" mode - needs to be looked into, maybe this could help with cookie handling