The Cache System (Crawl Data) - on9au/atar-rocks-downloader GitHub Wiki
The Cache System (or Crawl Data) are files containing the web crawler's data. They serve to cache that data to prevent having to re-scan the entire website again.
The Crawl Data is especially useful for those who just want to download the files from atar.rocks, those without high-end hardware to run the crawler, and etc.
It serves to save time by not scanning files multiple times.
Saving Crawl Data
To save Crawl Data when you run atar-rocks-downloader
, use the -s
flag
atar-rocks-downloader -s
You will see a confirmation like:
202x-xx-xxTxx:xx:xx.xxxxxxx INFO atar_rocks_downloader: Saving crawl data to file: crawl_data.bin
Notice how it will default to the file crawl_data.bin
? If you want to change that, use the -c <file name or directory>
flag in conjunction with -s
atar-rocks-downloader -s -c "hello_world.bin"
This will result in the confirmation:
202x-xx-xxTxx:xx:xx.xxxxxxx INFO atar_rocks_downloader: Saving crawl data to file: `hello_world.bin`
Loading Crawl Data
To load Crawl Data when you run atar-rocks-downloader
, use the -l
flag
atar-rocks-downloader -s
This will look for the default file name crawl_data.bin
. If the file is named differently, also use the -c <file name or directory>
in conjunction with -l
atar-rocks-downloader -s -c "hello_world.bin"
You will notice that when loading a Crawl Data file, it will skip the scanning phase entirely and prompt you to download.
Technical stuff
File Format
The file is a binary file, serialized and deserialized with the crate bincode
.
The decision to use bincode over other formats like json is due to its speed and efficiency.
Typically, the file entries could be around 10,000. Therefore, JSON will not be a very space efficient format. Furthermore, it would be likely to suffer from slow serde speeds.