05_duplicate_discovery_engine_.md - it255ru/duplo GitHub Wiki
In the previous chapter, Hash Cache, we perfected duplo's memory by teaching it to remember file "fingerprints" (hashes) so it wouldn't have to recalculate them constantly. This was a huge speed boost for finding identical files. We now have an efficient way to get a unique content identifier for any file.
But how do all these pieces—scanning, categorizing, hashing, and caching—come together to actually tell us, "Hey, these 5 files are exactly the same!"? This is where the Duplicate Discovery Engine steps in. It's the central brain that orchestrates the entire duplicate detection process.
duplo needs a robust, systematic approach to take a mountain of files and precisely pinpoint every single group of truly identical copies. It needs to combine all the smart techniques we've discussed into one "duplicate-finding machine."
Use Case: You've just copied a folder of photos from your phone to your computer, but you suspect you already have many of these photos scattered across various backup folders. You want duplo to give you a definitive list of every duplicate photo, video, and document so you can clear out the redundant copies and free up space.
To solve this, the Duplicate Discovery Engine needs to:
- Efficiently narrow down the search to only potential duplicates.
- Use unique file fingerprints to confirm identical content.
- Present a clear, organized report of all duplicate groups found.
Think of the Duplicate Discovery Engine as the chief detective or the master conductor of duplo. It doesn't just scan files or calculate hashes; it designs the strategy for finding duplicates and then executes it, making sure all the other components work together harmoniously.
Its strategy is built on two clever steps:
- Initial Sorting by Size: First, it gathers all the files collected by the File System Scanner and sorts them into "piles" based on their exact file size. Why? Because a file that's 10 MB can never be a duplicate of a file that's 12 MB! This immediately throws out many files as non-duplicates without any complex calculations.
- Fingerprinting (Hashing) for Same-Sized Piles: Only for the "piles" where two or more files share the exact same size does the engine then engage the File Hashing & Comparison system (and its Hash Cache helper!). It calculates the unique MD5 fingerprint for each file in these piles. If multiple files in a pile end up with the exact same fingerprint, then the engine confidently declares them to be identical duplicates.
When you ask duplo to scan a directory, for example:
python main.py my_messy_folderAfter the initial scanning and categorization, the Duplicate Discovery Engine takes over. You'll see messages that track its progress:
[+] Группировка файлов по размеру...
[+] Найдено 12 групп кандидатов в дубликаты (50 файлов)
[+] Вычисление хешей для поиска дубликатов...
Обработано: 50/50 файлов (90.5 файл/сек)
[+] Использовано хешей из кэша: 45
[+] Обработка завершена за 0.5 секунд
[+] Найдено групп дубликатов: 5
These messages show the engine first grouping by size ("Группировка файлов по размеру"), then finding "candidates" ("групп кандидатов в дубликаты") that need hashing. It then reports on the hashing process, including how many hashes were retrieved from the cache (thanks to Hash Cache!), and finally, how many unique duplicate groups it found.
The engine then presents its findings in a clear report, showing you which files are identical:
============================================================
ВСЕ НАЙДЕННЫЕ ДУБЛИКАТЫ
============================================================
Группа 1 (Хеш: 1a2b3c4d...), Размер: 2.50 MB, Категория: images
-> /home/user/my_messy_folder/holiday/sunset.jpg
-> /home/user/my_messy_folder/backup/sunset_copy.jpg
Группа 2 (Хеш: e8f9g0h1...), Размер: 10.10 MB, Категория: videos
-> /home/user/my_messy_folder/videos/party.mp4
-> /home/user/my_messy_folder/archive/old_party_video.mp4
-> /home/user/my_messy_folder/temp/party_renamed.mp4
Each "Группа" (Group) represents a set of files that the Duplicate Discovery Engine has confirmed are exact copies of each other.
Let's peek behind the curtain to understand how this "chief detective" organizes its duplicate hunt.
- Initial File List: The process begins with the complete list of all files found by the File System Scanner, including their path, size, and last modified time.
- Pile by Size: The engine immediately groups these files. Any files with a unique size are put aside (they can't have duplicates). Only "piles" containing two or more files of the exact same size are considered further. These are called "candidate groups."
-
Process Candidate Files: For each file within these candidate groups:
- It first asks the Hash Cache: "Do you already have a valid fingerprint for this file, given its path, size, and last modified time?"
- Cache Hit: If the cache says "Yes, here it is!" (meaning the file hasn't changed), the engine uses the cached fingerprint.
- Cache Miss: If the cache says "No, I don't have it or it's old," the engine then calls the File Hashing & Comparison component to read the file's content and calculate a brand-new fingerprint (MD5 hash). This new fingerprint is then added to the Hash Cache for future use.
- Group by Fingerprint: As each file's fingerprint is obtained (either from cache or by calculation), the engine starts grouping files by these fingerprints.
- Identify True Duplicates: Once all candidate files have their fingerprints, the engine looks at these fingerprint groups. Any fingerprint that has more than one file associated with it means those files are identical duplicates.
-
Report Findings: Finally,
duplocollects these groups of true duplicates and presents them in a user-friendly format.
Here's a simple diagram illustrating this process:
sequenceDiagram
participant User
participant DuploApp as "Duplo Application"
participant Engine as "Duplicate Discovery Engine"
participant Hasher as "File Hashing & Comparison"
participant Cache as "Hash Cache"
participant FileSystem as "File System"
User->>DuploApp: "Scan folder for duplicates"
DuploApp->>Engine: "Find duplicates (from scanned files)"
Note over Engine: Groups files by size
Engine->>Engine: Identifies candidate files (same size)
Engine->>Cache: "Get hash for file_A (path, size, mtime)"
Cache-->>Engine: Returns "None" (miss)
Engine->>Hasher: "Calculate hash for file_A"
Hasher->>FileSystem: Read file_A content
FileSystem-->>Hasher: Returns data blocks
Hasher-->>Engine: Returns "Hash_A_Value"
Engine->>Cache: "Store file_A (path, size, mtime, Hash_A_Value)"
Cache-->>Engine: Hash stored
Engine->>Engine: Groups file_A by "Hash_A_Value"
Engine->>Cache: "Get hash for file_B (path, size, mtime)"
Cache-->>Engine: Returns "Hash_A_Value" (hit)
Note over Engine: file_B is copy of file_A
Engine->>Engine: Groups file_B by "Hash_A_Value"
Note over Engine: All candidates processed
Engine->>Engine: Identifies groups with multiple files (e.g., Hash_A_Value has file_A, file_B)
Engine-->>DuploApp: Returns duplicate groups
DuploApp-->>User: Displays duplicate report
The heart of the Duplicate Discovery Engine is the find_duplicates_parallel function in main.py. This function coordinates all the steps we just described.
1. Grouping by Size: The first crucial step is to group all files by their size. Files of different sizes cannot be duplicates, so this is a powerful first filter.
# File: main.py
from collections import defaultdict
def find_duplicates_parallel(file_list, max_workers=8, cache_file=None):
# ... (cache initialization omitted for brevity) ...
print("\n[+] Группировка файлов по размеру...")
size_groups = defaultdict(list)
for file_path, file_size, mtime in file_list:
size_groups[file_size].append((file_path, mtime))
# Filter for groups with potential duplicates (2 or more files of same size)
candidate_groups = {size: paths for size, paths in size_groups.items() if len(paths) > 1}
print(f"[+] Найдено {len(candidate_groups)} групп кандидатов в дубликаты")
# ... (more code below) ...-
size_groups = defaultdict(list)creates a special dictionary where each file size will be a key, and its value will be a list of all files with that size. - The
forloop populatessize_groups. -
candidate_groupsthen filters this, keeping only those sizes (keys) that have more than one file in their list (meaninglen(paths) > 1). These are the only files that need further (and more intensive) hashing.
2. Hashing and Cache Integration:
For each file in these candidate_groups, duplo then proceeds to get its hash, using the Hash Cache to speed things up.
# File: main.py (inside find_duplicates_parallel function)
# ... (previous code for size grouping) ...
hashes_map = defaultdict(list) # Stores hash -> list of file paths
# ... (progress reporting setup omitted) ...
for size, files_in_group in candidate_groups.items():
for file_path, mtime in files_in_group:
file_size = size
file_hash = None
# Try to get hash from cache (Chapter 4)
if cache: # 'cache' object from HashCache class
file_hash = cache.get(file_path, file_size, mtime)
if file_hash:
# If found in cache, use it and move to next file
hashes_map[file_hash].append(file_path)
# ... (update cache hit counter) ...
continue # Skip hash calculation for this file
# If not in cache, or cache invalid, calculate the hash (Chapter 3)
if file_hash is None:
file_hash = get_file_hash(file_path) # Calls function from Chapter 3
if file_hash is not None and cache:
# Store newly calculated hash in cache (Chapter 4)
cache.set(file_path, file_size, mtime, file_hash)
if file_hash is not None:
hashes_map[file_hash].append(file_path)
# ... (update processed counter and print progress) ...
# ... (save cache and print summary omitted) ...-
hashes_map = defaultdict(list)is another special dictionary. Here, thefile_hashwill be the key, and the value will be a list of all file paths that share that exact hash. This is how the engine groups identical files together. - The loops iterate through
candidate_groups. - Inside, it first attempts to get the
file_hashfrom thecache(theHashCacheobject we saw in Chapter 4). - If
file_hashisNone(a cache miss), it callsget_file_hash(file_path)(from Chapter 3) to calculate the hash. - Once a hash is obtained (either from cache or calculated), it's added to
hashes_map.
3. Identifying Final Duplicates:
After processing all candidate files, hashes_map contains lists of all files that share the same fingerprint. The final step is to extract only those groups where more than one file shares the same hash.
# File: main.py (end of find_duplicates_parallel function)
# ... (code for hashing and cache integration) ...
# Finally, filter 'hashes_map' to keep only true duplicate groups
# (where more than one file shares the same hash)
duplicates = {h: paths for h, paths in hashes_map.items() if len(paths) > 1}
return duplicates-
duplicates = {h: paths for h, paths in hashes_map.items() if len(paths) > 1}is a concise way to create a new dictionary. It includes only those(hash, list_of_paths)pairs fromhashes_mapwhere thelist_of_pathshas more than one item. These are your confirmed duplicate groups!
The Duplicate Discovery Engine is the heart of duplo's mission. It cleverly combines size-based filtering, efficient content hashing (with the help of the Hash Cache), and smart grouping to precisely identify all identical copies of files scattered across your system. It takes all the groundwork laid by the previous components and transforms it into actionable insights: a clear list of every duplicate you possess.
Now that we can find identical files, what about finding entire folders that are exact copies of each other? That's the next detective task for duplo, which we'll explore in the next chapter: Identical Directory Finder.
Generated by AI Codebase Knowledge Builder. References: [1]