The Cache System (Crawl Data) - on9au/atar-rocks-downloader GitHub Wiki

The Cache System (or Crawl Data) are files containing the web crawler's data. They serve to cache that data to prevent having to re-scan the entire website again.

The Crawl Data is especially useful for those who just want to download the files from atar.rocks, those without high-end hardware to run the crawler, and etc.

It serves to save time by not scanning files multiple times.

Saving Crawl Data

To save Crawl Data when you run atar-rocks-downloader, use the -s flag

atar-rocks-downloader -s

You will see a confirmation like:

202x-xx-xxTxx:xx:xx.xxxxxxx  INFO atar_rocks_downloader: Saving crawl data to file: crawl_data.bin

Notice how it will default to the file crawl_data.bin? If you want to change that, use the -c <file name or directory> flag in conjunction with -s

atar-rocks-downloader -s -c "hello_world.bin"

This will result in the confirmation:

202x-xx-xxTxx:xx:xx.xxxxxxx  INFO atar_rocks_downloader: Saving crawl data to file: `hello_world.bin`

Loading Crawl Data

To load Crawl Data when you run atar-rocks-downloader, use the -l flag

atar-rocks-downloader -s

This will look for the default file name crawl_data.bin. If the file is named differently, also use the -c <file name or directory> in conjunction with -l

atar-rocks-downloader -s -c "hello_world.bin"

You will notice that when loading a Crawl Data file, it will skip the scanning phase entirely and prompt you to download.

Technical stuff

File Format

The file is a binary file, serialized and deserialized with the crate bincode.

The decision to use bincode over other formats like json is due to its speed and efficiency.

Typically, the file entries could be around 10,000. Therefore, JSON will not be a very space efficient format. Furthermore, it would be likely to suffer from slow serde speeds.