04_hash_cache_.md - it255ru/duplo GitHub Wiki

Chapter 4: Hash Cache

In the previous chapter, File Hashing & Comparison, we learned about the powerful technique of file hashing. We saw how duplo generates a unique "fingerprint" (an MD5 hash) for each file's content, allowing it to accurately identify truly identical files, no matter their name or location. This is a brilliant way to find duplicates!

However, imagine if you scanned a huge folder with thousands of files today, and then decided to scan it again tomorrow, even though you haven't changed most of the files. duplo would go through the entire process of reading every single file and calculating every single hash all over again. That's a lot of work for files that haven't changed! It's like re-fingerprinting someone every time you see them, even if you know they haven't altered their fingers since yesterday.

This is exactly the problem the Hash Cache solves. It's duplo's "smart memory" or "cheat sheet" for file fingerprints.

What Problem Does It Solve?

Calculating hashes for many large files can take a significant amount of time and processing power. If duplo has already calculated a file's hash once, and the file hasn't changed, there's no need to do that intensive calculation again. The Hash Cache allows duplo to remember previously calculated hashes and reuse them, making subsequent scans much, much faster.

Use Case: You regularly scan a large photo library to find new duplicates. The first scan takes an hour. You want future scans to be much quicker if only a few new photos have been added or changed.

To solve this, duplo needs to:

  1. Store the calculated hash (fingerprint) of a file.
  2. Also store key information about the file (like its size and when it was last changed).
  3. The next time it encounters the same file, it quickly checks its "cheat sheet."
  4. If the file's details match the stored details, it uses the remembered hash.
  5. If the file has changed, or it's a new file, it calculates a fresh hash and updates its cheat sheet.

Let's see how duplo uses this smart memory.

What is the Hash Cache?

The Hash Cache is essentially a persistent storage where duplo saves the MD5 hash (fingerprint) for files it has already processed. But just storing the hash isn't enough. It also stores crucial metadata alongside the hash:

  • File Path: The unique location of the file.
  • File Size: How big the file is.
  • Last Modified Time: When the file was last updated on your computer.

Why is this extra information important? Because if a file's size or last modified time changes, it means the file's content might have changed. In that case, the old hash is no longer valid, and duplo must recalculate it.

Think of it like this:

File Path Stored Size Stored MTime (Timestamp) Stored Hash Valid? Action
/photos/vacation.jpg (Current Scan) 2.5 MB 1678886400 1a2b3c4d... Yes Use cached hash
/docs/report.docx (Current Scan) 1.2 MB 1678886400 e8f9g0h1... No File size or mtime changed. Recalculate hash.
/new_folder/picture.png (Current Scan) (new file) (new file) (no cached hash) No Calculate new hash, then store it.

The cache file itself is a special file (usually named hash_cache.pkl by default) that duplo saves and loads.

How duplo Uses the Hash Cache

The Hash Cache works seamlessly with the File Hashing & Comparison process. When duplo is about to calculate a hash for a file, it first consults its cache.

When you run duplo, especially on a directory you've scanned before, you'll notice messages related to the cache:

python main.py my_messy_folder --cache-file my_cache.pkl

If it's the first time you run it, or many files have changed, you might see something like:

[+] Группировка файлов по размеру...
[+] Найдено 12 групп кандидатов в дубликаты (50 файлов)
[+] Вычисление хешей для поиска дубликатов...
Обработано: 50/50 файлов (10.2 файл/сек)
[+] Обработка завершена за 4.9 секунд

But if you run it again on the same folder without many changes:

[+] Группировка файлов по размеру...
[+] Найдено 12 групп кандидатов в дубликаты (50 файлов)
[+] Вычисление хешей для поиска дубликатов...
Обработано: 50/50 файлов (90.5 файл/сек) # Notice the much higher speed!
[+] Использовано хешей из кэша: 45      # It tells you how many it reused!
[+] Обработка завершена за 0.5 секунд   # Much faster!

This significantly faster processing time is due to the Hash Cache remembering the work it already did!

Under the Hood: How the Hash Cache Works

Let's look at how this "smart memory" functions behind the scenes.

Step-by-Step Walkthrough

  1. Start Scan: duplo starts scanning a directory.
  2. Load Cache: It first tries to load any existing hash cache from a file (e.g., hash_cache.pkl). If no file exists or it's corrupted, it starts with an empty cache.
  3. Find File Candidates: The File System Scanner and the initial size grouping identify files that might be duplicates (i.e., files with the same size).
  4. Check Cache for Each Candidate: For each potential duplicate file:
    • duplo asks the Hash Cache: "Do you have a valid hash for this file path, given its current size and last modified time?"
    • Cache Hit: If the cache does have a record, and the file's current size and modification time exactly match the cached values, the cache provides the stored hash. duplo marks this hash as "used from cache."
    • Cache Miss: If the cache doesn't have a record for that file, or if the file's size or modification time has changed since the last scan, the cache reports "no valid hash."
  5. Calculate or Use:
    • If the cache provided a valid hash (Cache Hit), duplo uses it directly.
    • If not (Cache Miss), duplo calls the File Hashing & Comparison component to calculate a new hash for the file.
  6. Update Cache: After a new hash is calculated, duplo tells the Hash Cache to store this new hash along with the file's path, current size, and current modification time.
  7. Save Cache: Once all files have been processed, duplo saves the updated cache back to the hash_cache.pkl file, so it's ready for the next scan.

Here's a simple diagram illustrating this process for a single file:

sequenceDiagram
    participant User
    participant DuploApp as "Duplo Application"
    participant HashCache as "Hash Cache"
    participant HashingLogic as "File Hashing Logic"
    participant FileSystem as "File System"

    User->>DuploApp: "Scan folder (with cache)"
    DuploApp->>HashCache: "Load existing cache"
    Note over DuploApp: Finds a candidate file (File A)

    DuploApp->>HashCache: "Get hash for File A (path, size, mtime)"
    HashCache->>HashCache: Checks if (path, size, mtime) match cached entry
    alt Cache Hit (match found)
        HashCache-->>DuploApp: Returns cached "Hash_A_Value"
        Note over DuploApp: Uses cached hash
    else Cache Miss (no match or changed file)
        HashCache-->>DuploApp: Returns "None" (no valid hash)
        DuploApp->>HashingLogic: "Calculate hash for File A"
        HashingLogic->>FileSystem: Read File A content
        FileSystem-->>HashingLogic: Returns data blocks
        HashingLogic-->>DuploApp: Returns calculated "Hash_A_Value"
        DuploApp->>HashCache: "Store File A (path, size, mtime, Hash_A_Value)"
        Note over DuploApp: New hash calculated and stored
    end
    Note over DuploApp: Continues with next file...

    DuploApp->>HashCache: "Save updated cache"
    HashCache-->>DuploApp: Cache saved
    DuploApp-->>User: Displays scan results
Loading

The Code Behind the Hash Cache

Let's look at the actual Python code from main.py that implements the Hash Cache. The core is the HashCache class.

1. The HashCache Class Initialization: This sets up the cache and tries to load any existing data.

# File: main.py
import pickle # Used to save/load Python objects to/from files
import os

class HashCache:
    """Класс для кэширования вычисленных хешей."""
    def __init__(self, cache_file="hash_cache.pkl"):
        self.cache_file = cache_file # The name of our cache file
        self.cache = self.load_cache() # Try to load existing cache
  • pickle is a Python module that allows you to save (serialize) Python objects (like dictionaries) to a file and load them back later. This is perfect for our cache!
  • __init__ is called when a HashCache object is created. It sets the filename and then immediately calls load_cache() to get any previously saved data.

2. Loading and Saving the Cache: These methods handle reading from and writing to the .pkl file.

    # ... inside HashCache class ...
    def load_cache(self):
        """Загружает кэш из файла."""
        if os.path.exists(self.cache_file): # Check if the cache file exists
            try:
                with open(self.cache_file, 'rb') as f: # Open in 'read binary' mode
                    return pickle.load(f) # Load the dictionary from the file
            except:
                # If something goes wrong (e.g., file corrupted), start fresh
                return {}
        return {} # If file doesn't exist, start with an empty cache
    
    def save_cache(self):
        """Сохраняет кэш в файл."""
        with open(self.cache_file, 'wb') as f: # Open in 'write binary' mode
            pickle.dump(self.cache, f) # Save our cache dictionary to the file
  • load_cache() checks if the cache file exists. If it does, it opens it in "read binary" ('rb') mode and uses pickle.load() to read the saved Python object (our dictionary) from it. The try...except block ensures duplo doesn't crash if the cache file is somehow unreadable, starting with an empty cache instead.
  • save_cache() opens the file in "write binary" ('wb') mode and uses pickle.dump() to write the current self.cache dictionary to the file.

3. Getting a Hash from the Cache: This is where duplo checks if it already "remembers" a file's hash.

    # ... inside HashCache class ...
    def get(self, file_path, file_size, mtime):
        """Получает хеш из кэша, если он актуален."""
        if file_path in self.cache: # Is this file path in our cache?
            cached_data = self.cache[file_path] # Get the stored info for this file
            if (cached_data['size'] == file_size and # Has the size changed?
                cached_data['mtime'] == mtime):   # Has the modification time changed?
                return cached_data['hash'] # If no changes, return the remembered hash
        return None # If not in cache or changed, return None (meaning 'not found/valid')
  • The get method takes the file_path, current file_size, and current mtime.
  • It first checks if file_path in self.cache.
  • Then, crucially, it compares the current file_size and mtime with the cached_data['size'] and cached_data['mtime']. Only if all three (path, size, mtime) match does it return the cached_data['hash'].
  • Otherwise, it returns None, indicating that duplo needs to calculate a new hash.

4. Storing a Hash in the Cache: This method updates the cache with a newly calculated hash.

    # ... inside HashCache class ...
    def set(self, file_path, file_size, mtime, file_hash):
        """Сохраняет хеш в кэш."""
        self.cache[file_path] = { # Store (or update) this file's entry
            'size': file_size,
            'mtime': mtime,
            'hash': file_hash
        }
  • The set method simply stores the provided file_path as a key, with a dictionary containing its size, mtime, and hash as the value. If the file was already in the cache, its old entry is overwritten with the new, up-to-date information.

5. Integration into find_duplicates_parallel: The find_duplicates_parallel function (which you saw in Chapter 3: File Hashing & Comparison) now uses the HashCache to speed up its work.

# File: main.py (simplified snippet)

def find_duplicates_parallel(file_list, max_workers=8, cache_file=None):
    # ... (initial setup and grouping by size) ...

    # 1. Initialize the cache
    cache = HashCache(cache_file) if cache_file else None
    
    # ... (code to identify candidate files for hashing) ...
    
    for size, files_in_group in candidate_groups.items():
        for file_path, mtime in files_in_group:
            file_size = size
            file_hash = None
            
            # 2. Try to get hash from cache FIRST!
            if cache:
                file_hash = cache.get(file_path, file_size, mtime)
                if file_hash:
                    # If we got a valid hash from cache, increment counter and skip calculation
                    from_cache += 1 # Keep track of how many hashes came from cache
                    hashes_map[file_hash].append(file_path)
                    processed += 1
                    continue # Go to the next file, no need to calculate
            
            # 3. If not in cache, or cache entry invalid, calculate it!
            if file_hash is None:
                file_hash = get_file_hash(file_path) # Call the hashing function

                # 4. If a hash was calculated, store it in the cache
                if file_hash is not None and cache:
                    cache.set(file_path, file_size, mtime, file_hash)
                
                if file_hash is not None:
                    hashes_map[file_hash].append(file_path)
            
            processed += 1
            # ... (progress reporting) ...
    
    # 5. Save the updated cache when done
    if cache:
        cache.save_cache()
    
    # ... (rest of duplicate detection logic) ...
    return duplicates

This snippet highlights the key changes:

  1. A HashCache object is created at the beginning of the find_duplicates_parallel function.
  2. Before calling get_file_hash, cache.get() is called. If it returns a hash, that hash is used, and the file is continued, skipping the actual hash calculation.
  3. If cache.get() returns None, then get_file_hash is called to calculate the hash.
  4. After calculation, cache.set() is called to store this fresh hash for future use.
  5. Finally, after all files are processed, cache.save_cache() is called to write all the updated cache information to disk.

Conclusion

The Hash Cache is a crucial optimization for duplo, transforming it from a powerful but potentially slow tool into an incredibly efficient one for repeated scans. By intelligently remembering file fingerprints and only recalculating hashes when a file genuinely changes, it saves significant time and resources. This "smart memory" greatly enhances the user experience, especially with large datasets.

Now that duplo can efficiently identify identical files using hashing (and its smart cache), the next step is to put all these pieces together into a coherent system that finds and presents all these duplicates. This is the job of the Duplicate Discovery Engine, which we'll explore in the next chapter: Duplicate Discovery Engine.


Generated by AI Codebase Knowledge Builder. References: [1]

⚠️ **GitHub.com Fallback** ⚠️