03_file_hashing___comparison_.md - it255ru/duplo GitHub Wiki
In the previous chapter, File Categorizer, we saw how duplo organizes your files into neat categories like "Images" or "Videos" based on their extensions. This gives us a great overview of our digital library. But what if you have two pictures, vacation_pic.jpg and holiday_photo_copy.jpg? They have different names, and might even be in different folders, but are they the exact same image inside? Just looking at their names, sizes, or categories isn't enough to tell.
This is where the File Hashing & Comparison system comes in. It's like having an advanced fingerprinting expert for your digital files.
duplo needs a reliable way to confirm if two files are truly identical in their content, regardless of their name, location, or even when they were last modified.
Use Case: You've scanned your my_messy_folder and found many files. You want to identify every single instance of the exact same photo or exact same document so you can safely delete the redundant copies.
To solve this, duplo needs to:
- Generate a unique "fingerprint" for the content of each file.
- Compare these fingerprints to find perfect matches.
Let's dive into how duplo achieves this with "hashing."
Imagine you have a very detailed document. A "hash" is a special mathematical process that takes all the content of that document and boils it down into a short, unique code, much like a human fingerprint. This short code is called a hash value or checksum. duplo uses a popular hashing method called MD5.
Here's why hashing is so powerful for finding duplicates:
- Unique Fingerprint: Even the tiniest change to a file (like adding a single space) will result in a completely different hash value. This means if two files have identical content, they must have the same hash.
-
Fixed Size: No matter if your file is 1 KB or 10 GB, its MD5 hash will always be a short string of characters (e.g.,
d41d8cd98f00b204e9800998ecf8427e). -
Fast Comparison: Instead of reading and comparing every byte of two potentially huge files,
duplojust needs to compare their short hash codes. This is incredibly fast!
Think of it like this:
| File Content | MD5 Hash (Fingerprint) |
|---|---|
| "Hello World!" | ed076287532e86365e841e92b396b575 |
| "Hello World! " (note the extra space) | f6f6990a4247565c82ff573397984852 |
| "Hello World!" (same as first row) | ed076287532e86365e841e92b396b575 |
image_a.jpg (a specific photo) |
1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d |
copy_of_image_a.jpg (an exact copy) |
1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d |
As you can see, identical content always produces identical hash values. Different content produces different hash values.
The File Hashing & Comparison process is the core mechanism duplo uses to accurately identify duplicates.
-
Initial Scan & Grouping by Size: After the File System Scanner creates a list of all files,
duplofirst groups them by their exact size. This is a smart optimization: files with different sizes can never be identical, so we don't need to waste time generating hashes for them. -
Hashing Candidates:
duplothen focuses only on groups where there are two or more files with the same size. For each file in these "candidate" groups, it calculates the MD5 hash. -
Comparing Hashes: Once all hashes are calculated,
duplocompares them. If multiple files (which already had the same size) also have the same MD5 hash,duploconfidently declares them to be identical duplicates.
When you run duplo, you'll see messages indicating this process:
[+] Группировка файлов по размеру...
[+] Найдено 12 групп кандидатов в дубликаты (50 файлов)
[+] Вычисление хешей для поиска дубликатов...
Обработано: 50/50 файлов (10.2 файл/сек)
[+] Обработка завершена за 4.9 секунд
[+] Найдено групп дубликатов: 5
Finally, duplo presents the identified duplicate groups, showing you which files share the same content:
============================================================
ВСЕ НАЙДЕННЫЕ ДУБЛИКАТЫ
============================================================
Группа 1 (Хеш: 1a2b3c4d...), Размер: 2.50 MB, Категория: images
-> /home/user/my_messy_folder/holiday/sunset.jpg
-> /home/user/my_messy_folder/backup/sunset_copy.jpg
Группа 2 (Хеш: e8f9g0h1...), Размер: 10.10 MB, Категория: videos
-> /home/user/my_messy_folder/videos/party.mp4
-> /home/user/my_messy_folder/archive/old_party_video.mp4
-> /home/user/my_messy_folder/temp/party_renamed.mp4
This output clearly shows you which files are duplicates, making it easy to decide which ones to keep and which to delete.
Let's look at the process in more detail.
- File List: The File System Scanner provides a list of all files with their paths, sizes, and modification times.
-
Size Grouping:
duploorganizes this list into groups where all files in a group have the exact same size. Files of unique sizes are immediately discarded as potential duplicates. -
Request Hash: For each file within a size-group (which are now "candidates"),
duploasks the hashing component to calculate its MD5 hash. - Read in Blocks: The hashing component doesn't load the entire file into memory at once (which would be very slow and use a lot of RAM for large files!). Instead, it reads the file in small chunks (blocks) one by one.
- Update Hash: As each chunk is read, it's fed into the MD5 calculation process, which continuously updates the overall hash value.
- Final Hash: Once all chunks have been read, the hashing component has the final, unique MD5 hash for the entire file.
-
Compare and Group:
duplocollects these hashes. Any files that share the exact same hash are then grouped together as confirmed duplicates.
Here's a simplified diagram of this process:
sequenceDiagram
participant User
participant DuploApp as "Duplo Application"
participant DuplicateFinder as "Duplicate Discovery Engine"
participant Hasher as "File Hashing Logic"
participant FileSystem as "File System"
User->>DuploApp: "Scan folder for duplicates"
DuploApp->>DuplicateFinder: "Find duplicates (file list from scanner)"
Note over DuplicateFinder: Groups files by size first
DuplicateFinder->>Hasher: "Get hash for 'file_A.jpg' (size X)"
Hasher->>FileSystem: "Read 'file_A.jpg' content (block by block)"
FileSystem-->>Hasher: Data Block 1
FileSystem-->>Hasher: Data Block 2
Note over Hasher: Updates hash value with each block
FileSystem-->>Hasher: ...last Data Block
Hasher-->>DuplicateFinder: Returns "Hash_A_Value"
DuplicateFinder->>Hasher: "Get hash for 'copy_of_file_A.jpg' (size X)"
Hasher->>FileSystem: "Read 'copy_of_file_A.jpg' content (block by block)"
FileSystem-->>Hasher: Data Block 1
FileSystem-->>Hasher: Data Block 2
Note over Hasher: Updates hash value with each block
FileSystem-->>Hasher: ...last Data Block
Hasher-->>DuplicateFinder: Returns "Hash_A_Value"
Note over DuplicateFinder: Compares "Hash_A_Value" with "Hash_A_Value"
DuplicateFinder-->>DuploApp: "Found: file_A.jpg and copy_of_file_A.jpg are identical"
DuploApp-->>User: Displays duplicate report
Let's look at the actual Python code from main.py that handles hashing. The core is the get_file_hash function.
1. The get_file_hash function:
This function takes a file's path and calculates its MD5 hash.
# File: main.py
import hashlib
import os
def get_file_hash(filepath, block_size=65536):
"""Вычисляет MD5 хеш файла блоками для экономии памяти."""
hash_func = hashlib.md5() # 1. Initialize the MD5 'fingerprint kit'
try:
with open(filepath, 'rb') as f: # 2. Open the file in 'read binary' mode
# 3. Read the file in chunks (blocks)
for block in iter(lambda: f.read(block_size), b''):
hash_func.update(block) # 4. Update the hash with each block
return hash_func.hexdigest() # 5. Get the final hash as a string
except (IOError, OSError):
return None # Handle cases where file can't be readLet's break down these lines:
-
hash_func = hashlib.md5(): This line creates an MD5 "hasher" object. Think of it as preparing a special device to create a fingerprint. -
with open(filepath, 'rb') as f:: This safely opens the file.rbmeans "read binary," which is important because files contain raw bytes, not just text. -
for block in iter(lambda: f.read(block_size), b''):: This is a clever way to read the file piece by piece.block_size=65536means it reads 65,536 bytes (64 KB) at a time. This prevents large files from hogging all your computer's memory. -
hash_func.update(block): Each piece of data (block) read from the file is fed into thehash_func. The hasher continuously updates its internal calculation to incorporate the new data. -
return hash_func.hexdigest(): After all blocks have been processed, this line converts the final calculated hash into a hexadecimal string, which is the short "fingerprint" you see.
2. Integration into find_duplicates_parallel:
The get_file_hash function is called within the find_duplicates_parallel function, which is responsible for orchestrating the duplicate discovery.
# File: main.py (simplified snippet)
# ... (other imports and functions) ...
def find_duplicates_parallel(file_list, max_workers=8, cache_file=None):
# ... (initial setup and grouping by size) ...
# Group files by size first (optimization)
size_groups = defaultdict(list)
for file_path, file_size, mtime in file_list:
size_groups[file_size].append((file_path, mtime))
# Filter for groups with more than one file (potential duplicates)
candidate_groups = {size: paths for size, paths in size_groups.items() if len(paths) > 1}
hashes_map = defaultdict(list) # This will store hashes -> list of file paths
processed = 0
# Iterate through potential duplicate files
for size, files_in_group in candidate_groups.items():
for file_path, mtime in files_in_group:
# ... (code to check hash cache, which we'll cover in the next chapter) ...
# If hash is not in cache, calculate it!
file_hash = get_file_hash(file_path) # <<< Here's the magic!
if file_hash is not None:
hashes_map[file_hash].append(file_path) # Add file to its hash group
processed += 1
# ... (progress reporting) ...
# After processing all files, 'hashes_map' contains groups of identical files
duplicates = {h: paths for h, paths in hashes_map.items() if len(paths) > 1}
return duplicatesIn this simplified view of find_duplicates_parallel:
- Files are first organized into
size_groupsto quickly filter out non-duplicates. - Then, for each file in the
candidate_groups(those with the same size),get_file_hash(file_path)is called to generate its unique fingerprint. - The
hashes_mapdictionary is crucial here: it stores each calculatedfile_hashas a key, and a list of allfile_paths that share that hash as its value. - Finally,
duplogoes throughhashes_mapand identifiesduplicatesas any hash key that has more than one file path associated with it. These are your true duplicates!
The File Hashing & Comparison system is the brain behind duplo's ability to find truly identical files. By generating unique "fingerprints" (MD5 hashes) for each file's content and then quickly comparing these fingerprints, duplo can identify duplicates with high accuracy and efficiency. This ensures that when you decide to clean up your files, you're deleting actual redundant copies, not just similarly named but different files.
However, calculating hashes for thousands or millions of files can still take a lot of time. What if duplo could remember the hashes it already calculated? That's precisely what the Hash Cache does, which we'll explore in the next chapter: Hash Cache.
Generated by AI Codebase Knowledge Builder. References: [1]