May 27, 2021 - UTMediaCAT/mediacat-docs GitHub Wiki
- testing Raiyan's new crawler in a Com Can instance to use more resources, and see if speed gain
** see if one giant instance or many instances
- John trying to install code locally and encountered errors
- Raiyan ran code on Graham instance, everything doing well but then errors:
** 11 ran well, and 7 only one page
** 11 did 5000 pages but only 3000 JSON files
** 4 main questions about why the number of pages crawled isn't the same as number of JSON files
*** After two days of running, infinite loop
*** Some domains only have a single page crawled and subsequent links are not added to the queue
*** JSON not being returned for some crawled links
*** Two specific domains did not have a results folder created
** bigger issue: after 2 days: crawler in infinite loop; crawler is in infinite loop on some pages, re running works again
** otherwise working better than previous versions: decent amount for each one
** what if run on NYT now? one of the domains that didn't work was the NYT: found links but didn't queue them
** server environment an issue? need debugging
** most pressing issue: 1, infinite loop, perhaps add more tracing & error saving
*** updated the list to shwo which working and which not, and also to think about which starting infinite loop
*** Raiyan will try several different vms to test the crawler: 1: only working domains; 2: full scope to record which domains are problematic; 3: only problematic domains
** use original scope and check which domains are running: then re-run without specific problematic domains
- John's errors
** followed github instructions, sample commands, got error that module missing
** Raiyan suggested using the other branch, batch crawling, as it probably has the more up to date installer
**