July 22, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda
- Twitter crawl list update & update on twitter crawl result/documentation
- Update on 1+ crawl on 1 instance
- Update on NYT crawl
- Update on Apify update
Twitter crawl list update
- with NYT handles, much more than 320, just crawled through everything without stopping
- 15 twitter handles couldn't be crawled, either no longer existing or private
#Twitter crawler documentation
- updated
Apify update
- didn't get a chance
- will look at it this week
general crawler:
- error reading a json file led to it stopping
- a bracket in a json file missing, so it wasn't complete and just stalled
- probably the script to restart automatically (currently script set to 5 hours)
Update on 1+ crawl on 1 instance
- reading from the same queue at first, but then a race condition, while working on a URL, the URL isn't removed from the queue; once the first crawler finished so then it removed it from the queue, then the second crawler couldn't find it when finished crawling, and then error
- possibility of splitting up the queue: won't work for the same domain (eg NYT), only for distinct domains; still need to account for the case that both
- currently, instance interacts with queue 3 x: (1) get an URL, (2) if error, check back, (3) remove from queue
- Raiyan will set up the politics subdomain of NYT
Order:
- set up politics subdomain under different instance
- update Apify
- look at postprocessor
- any ideas for speeding up: write down for next devs