June 23, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki
Agenda:
- Domain Crawler:
- memory issues from last week (what are the size of folders of JSONs after 24 hours)
- new implementations : adding time stamps to errors?
- testing crawler with larger instance
- crawler difference b/w links crawled and JSON files created
- Twitter Crawl:
- Status of NYT journos crawl, what's been discovered in terms of numbers.
Domain Crawler:
-
meeting was good, Nat's advice and time stamps worked
-
space: graham server: equally divided: need to check all folders (including those can't see) before setting up an instance
** ran a crawler on a large instance that Jacqueline had created
** went through debug -- at the 24 hour mark, crawler starts to have trouble opening up the new links, and that is
** error: new page time-out error gets more and more frequent.
** 12,000 links in a 24 hour period:
** at about 30 hour, it went to 15 links per minute w/o the time-out error
** then also got the range error
** when the crawler hits the time-out error, then put back in queue, and then can get into recursive loop until gets into range error
** apparently: lots of users are getting this error because of some stealth mode parameter, Raiyan will try to read and finesse the parameters; one question: can the time-out errors be put in a separate list? one possibility is to re-start the crawler with a separate script
** all the JSON files are now being created -- yay!
** size of typical JSON after 24 hours: 0.03 MB per JSON
** Apify: has a storage folder for the queue, and if it is set for two instances, it is probable that there would not be any problem with duplication.