May 27, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki

testing Raiyan's new crawler in a Com Can instance to use more resources, and see if speed gain

** see if one giant instance or many instances

John trying to install code locally and encountered errors

Raiyan ran code on Graham instance, everything doing well but then errors:

** 11 ran well, and 7 only one page

** 11 did 5000 pages but only 3000 JSON files

** 4 main questions about why the number of pages crawled isn't the same as number of JSON files

*** After two days of running, infinite loop

*** Some domains only have a single page crawled and subsequent links are not added to the queue

*** JSON not being returned for some crawled links

*** Two specific domains did not have a results folder created

** bigger issue: after 2 days: crawler in infinite loop; crawler is in infinite loop on some pages, re running works again

** otherwise working better than previous versions: decent amount for each one

** what if run on NYT now? one of the domains that didn't work was the NYT: found links but didn't queue them

** server environment an issue? need debugging

** most pressing issue: 1, infinite loop, perhaps add more tracing & error saving

*** updated the list to shwo which working and which not, and also to think about which starting infinite loop

*** Raiyan will try several different vms to test the crawler: 1: only working domains; 2: full scope to record which domains are problematic; 3: only problematic domains

** use original scope and check which domains are running: then re-run without specific problematic domains

John's errors

** followed github instructions, sample commands, got error that module missing

** Raiyan suggested using the other branch, batch crawling, as it probably has the more up to date installer