2.3.1.1 Crawler - montequie/darknet-network-topology GitHub Wiki
The crawlers scripts
-
full_scan.py - The script that responsible for the full scans, his onions seed set is maintained by itself and the onion_finder.py script
-
high_freq.py - The script that responsible for the high frequency scans, his onions seed set is a randomly chosen 1000 nodes from the latest full scan
-
onion_finder.py - The script that responsible for finding more onions by crawling the regular web and improving the size of the onions seed set, his own seed set is some links chosen a priory that publish publicly onion links
These 3 scripts' core is basically the same, the crawling flow implemented with 2 kind of Queues and 2 kind of Threads:
- URL Queue - a priority queue that first initialised with the original seed set with priority 1, every new discovered onion that was not in the set will be pushed into the queue with priority 2. a None object is inserted with priority 3 indicates the Crawler threads who use this queue as a signal that all the tasks are done
- Vertices Queue - a priority queue that contains the onion object ready to be inserted to the graph database, the objects in this queue has the 4 properties that the database scheme required
- Crawler Thread - 5 threads that popping the URL queue and pushing to both of the queues, these threads are responsible for crawling to the given links, discover more onions, creating the onion objects for the database and expand the seed set
- DB Thread - the thread that handle all the insertions to the graph database
The numbers mentioned above such as number of threads, timeout, etc are configurable and you can change them any time in the scripts.