February 25, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Meeting notes

  • Crawler performance
    • Some sites with single hits have more hits when tested alone (no other domain in queue)
    • Raiyan suggests exploring batching
      • The single hit may be because the queue is putting particular importance on certain domains, instead of equally across all our domains
      • To be explored: how to create batches, as well as automation for deciding when a batch has been crawled, timing of when to crawl the next batch
    • Jacqueline retrieved the database
    • One site being debugged had deliberately-hid popups (1/2 the time, button to close popup cannot be found)
    • Queue considerations
      • using Puppeteer without Apify would give us more control over the queueing
  • Create post-processor framework
    • Catch errors in post-processing and write data that has been processed to multiple files so they can be picked up after a break
    • Working on the picking back up after break still, but the files have been able to be saved
  • Discussion of output format
    • The one-to-one linkages as proposed on the Refactor-titled sheets may require a lot of programming changes
    • Issues to address
      • No entry for twitter account (rather than tweet) means that twitter user mention hits are replicated in many tweets
      • Current discovery of articles through tweet crawl does not retain article-specific URL - so these article mentions are lost/generalized to the domain