February 25, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki
Meeting notes
- Crawler performance
- Some sites with single hits have more hits when tested alone (no other domain in queue)
- Raiyan suggests exploring batching
- The single hit may be because the queue is putting particular importance on certain domains, instead of equally across all our domains
- To be explored: how to create batches, as well as automation for deciding when a batch has been crawled, timing of when to crawl the next batch
- Jacqueline retrieved the database
- One site being debugged had deliberately-hid popups (1/2 the time, button to close popup cannot be found)
- Queue considerations
- using Puppeteer without Apify would give us more control over the queueing
- Create post-processor framework
- Catch errors in post-processing and write data that has been processed to multiple files so they can be picked up after a break
- Working on the picking back up after break still, but the files have been able to be saved
- Discussion of output format
- The one-to-one linkages as proposed on the Refactor-titled sheets may require a lot of programming changes
- Issues to address
- No entry for twitter account (rather than tweet) means that twitter user mention hits are replicated in many tweets
- Current discovery of articles through tweet crawl does not retain article-specific URL - so these article mentions are lost/generalized to the domain