03_file_hashing___comparison_.md - it255ru/duplo GitHub Wiki

Chapter 3: File Hashing & Comparison

In the previous chapter, File Categorizer, we saw how duplo organizes your files into neat categories like "Images" or "Videos" based on their extensions. This gives us a great overview of our digital library. But what if you have two pictures, vacation_pic.jpg and holiday_photo_copy.jpg? They have different names, and might even be in different folders, but are they the exact same image inside? Just looking at their names, sizes, or categories isn't enough to tell.

This is where the File Hashing & Comparison system comes in. It's like having an advanced fingerprinting expert for your digital files.

What Problem Does It Solve?

duplo needs a reliable way to confirm if two files are truly identical in their content, regardless of their name, location, or even when they were last modified.

Use Case: You've scanned your my_messy_folder and found many files. You want to identify every single instance of the exact same photo or exact same document so you can safely delete the redundant copies.

To solve this, duplo needs to:

  1. Generate a unique "fingerprint" for the content of each file.
  2. Compare these fingerprints to find perfect matches.

Let's dive into how duplo achieves this with "hashing."

What is File Hashing?

Imagine you have a very detailed document. A "hash" is a special mathematical process that takes all the content of that document and boils it down into a short, unique code, much like a human fingerprint. This short code is called a hash value or checksum. duplo uses a popular hashing method called MD5.

Here's why hashing is so powerful for finding duplicates:

  • Unique Fingerprint: Even the tiniest change to a file (like adding a single space) will result in a completely different hash value. This means if two files have identical content, they must have the same hash.
  • Fixed Size: No matter if your file is 1 KB or 10 GB, its MD5 hash will always be a short string of characters (e.g., d41d8cd98f00b204e9800998ecf8427e).
  • Fast Comparison: Instead of reading and comparing every byte of two potentially huge files, duplo just needs to compare their short hash codes. This is incredibly fast!

Think of it like this:

File Content MD5 Hash (Fingerprint)
"Hello World!" ed076287532e86365e841e92b396b575
"Hello World! " (note the extra space) f6f6990a4247565c82ff573397984852
"Hello World!" (same as first row) ed076287532e86365e841e92b396b575
image_a.jpg (a specific photo) 1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d
copy_of_image_a.jpg (an exact copy) 1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d

As you can see, identical content always produces identical hash values. Different content produces different hash values.

How duplo Uses Hashing & Comparison

The File Hashing & Comparison process is the core mechanism duplo uses to accurately identify duplicates.

  1. Initial Scan & Grouping by Size: After the File System Scanner creates a list of all files, duplo first groups them by their exact size. This is a smart optimization: files with different sizes can never be identical, so we don't need to waste time generating hashes for them.
  2. Hashing Candidates: duplo then focuses only on groups where there are two or more files with the same size. For each file in these "candidate" groups, it calculates the MD5 hash.
  3. Comparing Hashes: Once all hashes are calculated, duplo compares them. If multiple files (which already had the same size) also have the same MD5 hash, duplo confidently declares them to be identical duplicates.

When you run duplo, you'll see messages indicating this process:

[+] Группировка файлов по размеру...
[+] Найдено 12 групп кандидатов в дубликаты (50 файлов)
[+] Вычисление хешей для поиска дубликатов...
Обработано: 50/50 файлов (10.2 файл/сек)
[+] Обработка завершена за 4.9 секунд
[+] Найдено групп дубликатов: 5

Finally, duplo presents the identified duplicate groups, showing you which files share the same content:

============================================================
ВСЕ НАЙДЕННЫЕ ДУБЛИКАТЫ
============================================================

Группа 1 (Хеш: 1a2b3c4d...), Размер: 2.50 MB, Категория: images
  -> /home/user/my_messy_folder/holiday/sunset.jpg
  -> /home/user/my_messy_folder/backup/sunset_copy.jpg

Группа 2 (Хеш: e8f9g0h1...), Размер: 10.10 MB, Категория: videos
  -> /home/user/my_messy_folder/videos/party.mp4
  -> /home/user/my_messy_folder/archive/old_party_video.mp4
  -> /home/user/my_messy_folder/temp/party_renamed.mp4

This output clearly shows you which files are duplicates, making it easy to decide which ones to keep and which to delete.

Under the Hood: How Hashing Works

Let's look at the process in more detail.

Step-by-Step Walkthrough

  1. File List: The File System Scanner provides a list of all files with their paths, sizes, and modification times.
  2. Size Grouping: duplo organizes this list into groups where all files in a group have the exact same size. Files of unique sizes are immediately discarded as potential duplicates.
  3. Request Hash: For each file within a size-group (which are now "candidates"), duplo asks the hashing component to calculate its MD5 hash.
  4. Read in Blocks: The hashing component doesn't load the entire file into memory at once (which would be very slow and use a lot of RAM for large files!). Instead, it reads the file in small chunks (blocks) one by one.
  5. Update Hash: As each chunk is read, it's fed into the MD5 calculation process, which continuously updates the overall hash value.
  6. Final Hash: Once all chunks have been read, the hashing component has the final, unique MD5 hash for the entire file.
  7. Compare and Group: duplo collects these hashes. Any files that share the exact same hash are then grouped together as confirmed duplicates.

Here's a simplified diagram of this process:

sequenceDiagram
    participant User
    participant DuploApp as "Duplo Application"
    participant DuplicateFinder as "Duplicate Discovery Engine"
    participant Hasher as "File Hashing Logic"
    participant FileSystem as "File System"

    User->>DuploApp: "Scan folder for duplicates"
    DuploApp->>DuplicateFinder: "Find duplicates (file list from scanner)"
    Note over DuplicateFinder: Groups files by size first

    DuplicateFinder->>Hasher: "Get hash for 'file_A.jpg' (size X)"
    Hasher->>FileSystem: "Read 'file_A.jpg' content (block by block)"
    FileSystem-->>Hasher: Data Block 1
    FileSystem-->>Hasher: Data Block 2
    Note over Hasher: Updates hash value with each block
    FileSystem-->>Hasher: ...last Data Block
    Hasher-->>DuplicateFinder: Returns "Hash_A_Value"

    DuplicateFinder->>Hasher: "Get hash for 'copy_of_file_A.jpg' (size X)"
    Hasher->>FileSystem: "Read 'copy_of_file_A.jpg' content (block by block)"
    FileSystem-->>Hasher: Data Block 1
    FileSystem-->>Hasher: Data Block 2
    Note over Hasher: Updates hash value with each block
    FileSystem-->>Hasher: ...last Data Block
    Hasher-->>DuplicateFinder: Returns "Hash_A_Value"

    Note over DuplicateFinder: Compares "Hash_A_Value" with "Hash_A_Value"
    DuplicateFinder-->>DuploApp: "Found: file_A.jpg and copy_of_file_A.jpg are identical"
    DuploApp-->>User: Displays duplicate report
Loading

The Code Behind the Hashing

Let's look at the actual Python code from main.py that handles hashing. The core is the get_file_hash function.

1. The get_file_hash function: This function takes a file's path and calculates its MD5 hash.

# File: main.py
import hashlib
import os

def get_file_hash(filepath, block_size=65536):
    """Вычисляет MD5 хеш файла блоками для экономии памяти."""
    hash_func = hashlib.md5() # 1. Initialize the MD5 'fingerprint kit'
    try:
        with open(filepath, 'rb') as f: # 2. Open the file in 'read binary' mode
            # 3. Read the file in chunks (blocks)
            for block in iter(lambda: f.read(block_size), b''):
                hash_func.update(block) # 4. Update the hash with each block
        return hash_func.hexdigest() # 5. Get the final hash as a string
    except (IOError, OSError):
        return None # Handle cases where file can't be read

Let's break down these lines:

  1. hash_func = hashlib.md5(): This line creates an MD5 "hasher" object. Think of it as preparing a special device to create a fingerprint.
  2. with open(filepath, 'rb') as f:: This safely opens the file. rb means "read binary," which is important because files contain raw bytes, not just text.
  3. for block in iter(lambda: f.read(block_size), b''):: This is a clever way to read the file piece by piece. block_size=65536 means it reads 65,536 bytes (64 KB) at a time. This prevents large files from hogging all your computer's memory.
  4. hash_func.update(block): Each piece of data (block) read from the file is fed into the hash_func. The hasher continuously updates its internal calculation to incorporate the new data.
  5. return hash_func.hexdigest(): After all blocks have been processed, this line converts the final calculated hash into a hexadecimal string, which is the short "fingerprint" you see.

2. Integration into find_duplicates_parallel: The get_file_hash function is called within the find_duplicates_parallel function, which is responsible for orchestrating the duplicate discovery.

# File: main.py (simplified snippet)

# ... (other imports and functions) ...

def find_duplicates_parallel(file_list, max_workers=8, cache_file=None):
    # ... (initial setup and grouping by size) ...

    # Group files by size first (optimization)
    size_groups = defaultdict(list)
    for file_path, file_size, mtime in file_list:
        size_groups[file_size].append((file_path, mtime))
    
    # Filter for groups with more than one file (potential duplicates)
    candidate_groups = {size: paths for size, paths in size_groups.items() if len(paths) > 1}
    
    hashes_map = defaultdict(list) # This will store hashes -> list of file paths
    processed = 0
    
    # Iterate through potential duplicate files
    for size, files_in_group in candidate_groups.items():
        for file_path, mtime in files_in_group:
            # ... (code to check hash cache, which we'll cover in the next chapter) ...
            
            # If hash is not in cache, calculate it!
            file_hash = get_file_hash(file_path) # <<< Here's the magic!

            if file_hash is not None:
                hashes_map[file_hash].append(file_path) # Add file to its hash group
            
            processed += 1
            # ... (progress reporting) ...

    # After processing all files, 'hashes_map' contains groups of identical files
    duplicates = {h: paths for h, paths in hashes_map.items() if len(paths) > 1}
    return duplicates

In this simplified view of find_duplicates_parallel:

  • Files are first organized into size_groups to quickly filter out non-duplicates.
  • Then, for each file in the candidate_groups (those with the same size), get_file_hash(file_path) is called to generate its unique fingerprint.
  • The hashes_map dictionary is crucial here: it stores each calculated file_hash as a key, and a list of all file_paths that share that hash as its value.
  • Finally, duplo goes through hashes_map and identifies duplicates as any hash key that has more than one file path associated with it. These are your true duplicates!

Conclusion

The File Hashing & Comparison system is the brain behind duplo's ability to find truly identical files. By generating unique "fingerprints" (MD5 hashes) for each file's content and then quickly comparing these fingerprints, duplo can identify duplicates with high accuracy and efficiency. This ensures that when you decide to clean up your files, you're deleting actual redundant copies, not just similarly named but different files.

However, calculating hashes for thousands or millions of files can still take a lot of time. What if duplo could remember the hashes it already calculated? That's precisely what the Hash Cache does, which we'll explore in the next chapter: Hash Cache.


Generated by AI Codebase Knowledge Builder. References: [1]

⚠️ **GitHub.com Fallback** ⚠️