February 25, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Meeting notes

Crawler performance
- Some sites with single hits have more hits when tested alone (no other domain in queue)
- Raiyan suggests exploring batching
  - The single hit may be because the queue is putting particular importance on certain domains, instead of equally across all our domains
  - To be explored: how to create batches, as well as automation for deciding when a batch has been crawled, timing of when to crawl the next batch
- Jacqueline retrieved the database
- One site being debugged had deliberately-hid popups (1/2 the time, button to close popup cannot be found)
- Queue considerations
  - using Puppeteer without Apify would give us more control over the queueing
Create post-processor framework
- Catch errors in post-processing and write data that has been processed to multiple files so they can be picked up after a break
- Working on the picking back up after break still, but the files have been able to be saved
Discussion of output format
- The one-to-one linkages as proposed on the Refactor-titled sheets may require a lot of programming changes
- Issues to address
  - No entry for twitter account (rather than tweet) means that twitter user mention hits are replicated in many tweets
  - Current discovery of articles through tweet crawl does not retain article-specific URL - so these article mentions are lost/generalized to the domain