July 22, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

with NYT handles, much more than 320, just crawled through everything without stopping
15 twitter handles couldn't be crawled, either no longer existing or private

#Twitter crawler documentation

reading from the same queue at first, but then a race condition, while working on a URL, the URL isn't removed from the queue; once the first crawler finished so then it removed it from the queue, then the second crawler couldn't find it when finished crawling, and then error
possibility of splitting up the queue: won't work for the same domain (eg NYT), only for distinct domains; still need to account for the case that both
currently, instance interacts with queue 3 x: (1) get an URL, (2) if error, check back, (3) remove from queue
Raiyan will set up the politics subdomain of NYT