04_hash_cache_.md - it255ru/duplo GitHub Wiki
In the previous chapter, File Hashing & Comparison, we learned about the powerful technique of file hashing. We saw how duplo generates a unique "fingerprint" (an MD5 hash) for each file's content, allowing it to accurately identify truly identical files, no matter their name or location. This is a brilliant way to find duplicates!
However, imagine if you scanned a huge folder with thousands of files today, and then decided to scan it again tomorrow, even though you haven't changed most of the files. duplo would go through the entire process of reading every single file and calculating every single hash all over again. That's a lot of work for files that haven't changed! It's like re-fingerprinting someone every time you see them, even if you know they haven't altered their fingers since yesterday.
This is exactly the problem the Hash Cache solves. It's duplo's "smart memory" or "cheat sheet" for file fingerprints.
Calculating hashes for many large files can take a significant amount of time and processing power. If duplo has already calculated a file's hash once, and the file hasn't changed, there's no need to do that intensive calculation again. The Hash Cache allows duplo to remember previously calculated hashes and reuse them, making subsequent scans much, much faster.
Use Case: You regularly scan a large photo library to find new duplicates. The first scan takes an hour. You want future scans to be much quicker if only a few new photos have been added or changed.
To solve this, duplo needs to:
- Store the calculated hash (fingerprint) of a file.
- Also store key information about the file (like its size and when it was last changed).
- The next time it encounters the same file, it quickly checks its "cheat sheet."
- If the file's details match the stored details, it uses the remembered hash.
- If the file has changed, or it's a new file, it calculates a fresh hash and updates its cheat sheet.
Let's see how duplo uses this smart memory.
The Hash Cache is essentially a persistent storage where duplo saves the MD5 hash (fingerprint) for files it has already processed. But just storing the hash isn't enough. It also stores crucial metadata alongside the hash:
- File Path: The unique location of the file.
- File Size: How big the file is.
- Last Modified Time: When the file was last updated on your computer.
Why is this extra information important? Because if a file's size or last modified time changes, it means the file's content might have changed. In that case, the old hash is no longer valid, and duplo must recalculate it.
Think of it like this:
| File Path | Stored Size | Stored MTime (Timestamp) | Stored Hash | Valid? | Action |
|---|---|---|---|---|---|
/photos/vacation.jpg (Current Scan) |
2.5 MB | 1678886400 |
1a2b3c4d... |
Yes | Use cached hash |
/docs/report.docx (Current Scan) |
1.2 MB | 1678886400 |
e8f9g0h1... |
No | File size or mtime changed. Recalculate hash. |
/new_folder/picture.png (Current Scan) |
(new file) | (new file) | (no cached hash) | No | Calculate new hash, then store it. |
The cache file itself is a special file (usually named hash_cache.pkl by default) that duplo saves and loads.
The Hash Cache works seamlessly with the File Hashing & Comparison process. When duplo is about to calculate a hash for a file, it first consults its cache.
When you run duplo, especially on a directory you've scanned before, you'll notice messages related to the cache:
python main.py my_messy_folder --cache-file my_cache.pklIf it's the first time you run it, or many files have changed, you might see something like:
[+] Группировка файлов по размеру...
[+] Найдено 12 групп кандидатов в дубликаты (50 файлов)
[+] Вычисление хешей для поиска дубликатов...
Обработано: 50/50 файлов (10.2 файл/сек)
[+] Обработка завершена за 4.9 секунд
But if you run it again on the same folder without many changes:
[+] Группировка файлов по размеру...
[+] Найдено 12 групп кандидатов в дубликаты (50 файлов)
[+] Вычисление хешей для поиска дубликатов...
Обработано: 50/50 файлов (90.5 файл/сек) # Notice the much higher speed!
[+] Использовано хешей из кэша: 45 # It tells you how many it reused!
[+] Обработка завершена за 0.5 секунд # Much faster!
This significantly faster processing time is due to the Hash Cache remembering the work it already did!
Let's look at how this "smart memory" functions behind the scenes.
-
Start Scan:
duplostarts scanning a directory. -
Load Cache: It first tries to load any existing hash cache from a file (e.g.,
hash_cache.pkl). If no file exists or it's corrupted, it starts with an empty cache. - Find File Candidates: The File System Scanner and the initial size grouping identify files that might be duplicates (i.e., files with the same size).
-
Check Cache for Each Candidate: For each potential duplicate file:
-
duploasks the Hash Cache: "Do you have a valid hash for this file path, given its current size and last modified time?" -
Cache Hit: If the cache does have a record, and the file's current size and modification time exactly match the cached values, the cache provides the stored hash.
duplomarks this hash as "used from cache." - Cache Miss: If the cache doesn't have a record for that file, or if the file's size or modification time has changed since the last scan, the cache reports "no valid hash."
-
-
Calculate or Use:
- If the cache provided a valid hash (Cache Hit),
duplouses it directly. - If not (Cache Miss),
duplocalls the File Hashing & Comparison component to calculate a new hash for the file.
- If the cache provided a valid hash (Cache Hit),
-
Update Cache: After a new hash is calculated,
duplotells the Hash Cache to store this new hash along with the file's path, current size, and current modification time. -
Save Cache: Once all files have been processed,
duplosaves the updated cache back to thehash_cache.pklfile, so it's ready for the next scan.
Here's a simple diagram illustrating this process for a single file:
sequenceDiagram
participant User
participant DuploApp as "Duplo Application"
participant HashCache as "Hash Cache"
participant HashingLogic as "File Hashing Logic"
participant FileSystem as "File System"
User->>DuploApp: "Scan folder (with cache)"
DuploApp->>HashCache: "Load existing cache"
Note over DuploApp: Finds a candidate file (File A)
DuploApp->>HashCache: "Get hash for File A (path, size, mtime)"
HashCache->>HashCache: Checks if (path, size, mtime) match cached entry
alt Cache Hit (match found)
HashCache-->>DuploApp: Returns cached "Hash_A_Value"
Note over DuploApp: Uses cached hash
else Cache Miss (no match or changed file)
HashCache-->>DuploApp: Returns "None" (no valid hash)
DuploApp->>HashingLogic: "Calculate hash for File A"
HashingLogic->>FileSystem: Read File A content
FileSystem-->>HashingLogic: Returns data blocks
HashingLogic-->>DuploApp: Returns calculated "Hash_A_Value"
DuploApp->>HashCache: "Store File A (path, size, mtime, Hash_A_Value)"
Note over DuploApp: New hash calculated and stored
end
Note over DuploApp: Continues with next file...
DuploApp->>HashCache: "Save updated cache"
HashCache-->>DuploApp: Cache saved
DuploApp-->>User: Displays scan results
Let's look at the actual Python code from main.py that implements the Hash Cache. The core is the HashCache class.
1. The HashCache Class Initialization:
This sets up the cache and tries to load any existing data.
# File: main.py
import pickle # Used to save/load Python objects to/from files
import os
class HashCache:
"""Класс для кэширования вычисленных хешей."""
def __init__(self, cache_file="hash_cache.pkl"):
self.cache_file = cache_file # The name of our cache file
self.cache = self.load_cache() # Try to load existing cache-
pickleis a Python module that allows you to save (serialize) Python objects (like dictionaries) to a file and load them back later. This is perfect for our cache! -
__init__is called when aHashCacheobject is created. It sets the filename and then immediately callsload_cache()to get any previously saved data.
2. Loading and Saving the Cache:
These methods handle reading from and writing to the .pkl file.
# ... inside HashCache class ...
def load_cache(self):
"""Загружает кэш из файла."""
if os.path.exists(self.cache_file): # Check if the cache file exists
try:
with open(self.cache_file, 'rb') as f: # Open in 'read binary' mode
return pickle.load(f) # Load the dictionary from the file
except:
# If something goes wrong (e.g., file corrupted), start fresh
return {}
return {} # If file doesn't exist, start with an empty cache
def save_cache(self):
"""Сохраняет кэш в файл."""
with open(self.cache_file, 'wb') as f: # Open in 'write binary' mode
pickle.dump(self.cache, f) # Save our cache dictionary to the file-
load_cache()checks if the cache file exists. If it does, it opens it in "read binary" ('rb') mode and usespickle.load()to read the saved Python object (our dictionary) from it. Thetry...exceptblock ensuresduplodoesn't crash if the cache file is somehow unreadable, starting with an empty cache instead. -
save_cache()opens the file in "write binary" ('wb') mode and usespickle.dump()to write the currentself.cachedictionary to the file.
3. Getting a Hash from the Cache:
This is where duplo checks if it already "remembers" a file's hash.
# ... inside HashCache class ...
def get(self, file_path, file_size, mtime):
"""Получает хеш из кэша, если он актуален."""
if file_path in self.cache: # Is this file path in our cache?
cached_data = self.cache[file_path] # Get the stored info for this file
if (cached_data['size'] == file_size and # Has the size changed?
cached_data['mtime'] == mtime): # Has the modification time changed?
return cached_data['hash'] # If no changes, return the remembered hash
return None # If not in cache or changed, return None (meaning 'not found/valid')- The
getmethod takes thefile_path, currentfile_size, and currentmtime. - It first checks
if file_path in self.cache. - Then, crucially, it compares the current
file_sizeandmtimewith thecached_data['size']andcached_data['mtime']. Only if all three (path, size, mtime) match does it return thecached_data['hash']. - Otherwise, it returns
None, indicating thatduploneeds to calculate a new hash.
4. Storing a Hash in the Cache: This method updates the cache with a newly calculated hash.
# ... inside HashCache class ...
def set(self, file_path, file_size, mtime, file_hash):
"""Сохраняет хеш в кэш."""
self.cache[file_path] = { # Store (or update) this file's entry
'size': file_size,
'mtime': mtime,
'hash': file_hash
}- The
setmethod simply stores the providedfile_pathas a key, with a dictionary containing itssize,mtime, andhashas the value. If the file was already in the cache, its old entry is overwritten with the new, up-to-date information.
5. Integration into find_duplicates_parallel:
The find_duplicates_parallel function (which you saw in Chapter 3: File Hashing & Comparison) now uses the HashCache to speed up its work.
# File: main.py (simplified snippet)
def find_duplicates_parallel(file_list, max_workers=8, cache_file=None):
# ... (initial setup and grouping by size) ...
# 1. Initialize the cache
cache = HashCache(cache_file) if cache_file else None
# ... (code to identify candidate files for hashing) ...
for size, files_in_group in candidate_groups.items():
for file_path, mtime in files_in_group:
file_size = size
file_hash = None
# 2. Try to get hash from cache FIRST!
if cache:
file_hash = cache.get(file_path, file_size, mtime)
if file_hash:
# If we got a valid hash from cache, increment counter and skip calculation
from_cache += 1 # Keep track of how many hashes came from cache
hashes_map[file_hash].append(file_path)
processed += 1
continue # Go to the next file, no need to calculate
# 3. If not in cache, or cache entry invalid, calculate it!
if file_hash is None:
file_hash = get_file_hash(file_path) # Call the hashing function
# 4. If a hash was calculated, store it in the cache
if file_hash is not None and cache:
cache.set(file_path, file_size, mtime, file_hash)
if file_hash is not None:
hashes_map[file_hash].append(file_path)
processed += 1
# ... (progress reporting) ...
# 5. Save the updated cache when done
if cache:
cache.save_cache()
# ... (rest of duplicate detection logic) ...
return duplicatesThis snippet highlights the key changes:
- A
HashCacheobject is created at the beginning of thefind_duplicates_parallelfunction. - Before calling
get_file_hash,cache.get()is called. If it returns a hash, that hash is used, and the file iscontinued, skipping the actual hash calculation. - If
cache.get()returnsNone, thenget_file_hashis called to calculate the hash. - After calculation,
cache.set()is called to store this fresh hash for future use. - Finally, after all files are processed,
cache.save_cache()is called to write all the updated cache information to disk.
The Hash Cache is a crucial optimization for duplo, transforming it from a powerful but potentially slow tool into an incredibly efficient one for repeated scans. By intelligently remembering file fingerprints and only recalculating hashes when a file genuinely changes, it saves significant time and resources. This "smart memory" greatly enhances the user experience, especially with large datasets.
Now that duplo can efficiently identify identical files using hashing (and its smart cache), the next step is to put all these pieces together into a coherent system that finds and presents all these duplicates. This is the job of the Duplicate Discovery Engine, which we'll explore in the next chapter: Duplicate Discovery Engine.
Generated by AI Codebase Knowledge Builder. References: [1]