June 23, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

Agenda:

  • Domain Crawler:
  1. memory issues from last week (what are the size of folders of JSONs after 24 hours)
  2. new implementations : adding time stamps to errors?
  3. testing crawler with larger instance
  4. crawler difference b/w links crawled and JSON files created
  • Twitter Crawl:
  1. Status of NYT journos crawl, what's been discovered in terms of numbers.

Domain Crawler:

  • meeting was good, Nat's advice and time stamps worked

  • space: graham server: equally divided: need to check all folders (including those can't see) before setting up an instance

** ran a crawler on a large instance that Jacqueline had created

** went through debug -- at the 24 hour mark, crawler starts to have trouble opening up the new links, and that is

** error: new page time-out error gets more and more frequent.

** 12,000 links in a 24 hour period:

** at about 30 hour, it went to 15 links per minute w/o the time-out error

** then also got the range error

** when the crawler hits the time-out error, then put back in queue, and then can get into recursive loop until gets into range error

** apparently: lots of users are getting this error because of some stealth mode parameter, Raiyan will try to read and finesse the parameters; one question: can the time-out errors be put in a separate list? one possibility is to re-start the crawler with a separate script

** all the JSON files are now being created -- yay!

** size of typical JSON after 24 hours: 0.03 MB per JSON

** Apify: has a storage folder for the queue, and if it is set for two instances, it is probable that there would not be any problem with duplication.