July 22, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda

  • Twitter crawl list update & update on twitter crawl result/documentation
  • Update on 1+ crawl on 1 instance
  • Update on NYT crawl
  • Update on Apify update

Twitter crawl list update

  • with NYT handles, much more than 320, just crawled through everything without stopping
  • 15 twitter handles couldn't be crawled, either no longer existing or private

#Twitter crawler documentation

  • updated

Apify update

  • didn't get a chance
  • will look at it this week

general crawler:

  • error reading a json file led to it stopping
  • a bracket in a json file missing, so it wasn't complete and just stalled
  • probably the script to restart automatically (currently script set to 5 hours)

Update on 1+ crawl on 1 instance

  • reading from the same queue at first, but then a race condition, while working on a URL, the URL isn't removed from the queue; once the first crawler finished so then it removed it from the queue, then the second crawler couldn't find it when finished crawling, and then error
  • possibility of splitting up the queue: won't work for the same domain (eg NYT), only for distinct domains; still need to account for the case that both
  • currently, instance interacts with queue 3 x: (1) get an URL, (2) if error, check back, (3) remove from queue
  • Raiyan will set up the politics subdomain of NYT

Order:

  1. set up politics subdomain under different instance
  2. update Apify
  3. look at postprocessor
  4. any ideas for speeding up: write down for next devs