06_identical_directory_finder_.md - it255ru/duplo GitHub Wiki

Chapter 6: Identical Directory Finder

In the previous chapter, Duplicate Discovery Engine, we learned how duplo is a super-sleuth at finding individual duplicate files scattered across your computer. It uses clever techniques like size grouping and unique content "fingerprints" (hashes) to tell you precisely which files are exact copies. This is incredibly powerful for cleaning up individual redundant files.

But what if you have an entire folder that is an exact copy of another folder, perhaps with a different name or in a different location? Maybe you made several backups of your "Holiday 2023" photos, and now you have C:\Photos\Holiday 2023, D:\Backups\Holiday 2023 - Copy, and E:\Archive\Old Holidays\Holiday 2023. While the file names and sizes inside these folders might be the same, duplo needs a way to confirm if the entire collection of files within them is truly identical.

This is where the Identical Directory Finder comes into play. It takes duplo's file-level detective work and applies it to whole folders, helping you spot and manage entire duplicate directories.

What Problem Does It Solve?

Finding duplicate files is great, but managing many individual files can still be daunting. The Identical Directory Finder helps duplo identify entire folders that contain the exact same set of files, regardless of the folder's name or location. This is crucial for decluttering large backups or archives where you might have multiple redundant copies of entire projects or photo albums.

Use Case: You've run duplo on your main drive and found many individual duplicate files. Now you want to know if any of your old "backup" folders are completely redundant, meaning they contain the exact same content as another folder, so you can delete the entire redundant backup folder in one go.

To solve this, duplo needs to:

  1. Understand the unique "composition" of each directory based on its files.
  2. Compare these compositions to find perfectly matching directories.

What is the Identical Directory Finder?

Think of the Identical Directory Finder as a specialized librarian who doesn't just check if two individual books are the same, but rather if two entire bookshelves contain the exact same collection of books. The order of books on the shelf doesn't matter to this librarian, only that the same books are present.

Here's the core idea:

  • Directory "Playlist" or "Signature": For each directory, duplo creates a unique "playlist" or "signature." This signature is an ordered list of the MD5 hashes (fingerprints) of all the files contained within that directory.
  • Hash Consistency: Since each file's hash is unique to its content (as we learned in File Hashing & Comparison), a directory's signature effectively represents the unique content blueprint of that entire folder.
  • Order Doesn't Matter (for files): The actual order in which files appear on your disk or are listed doesn't affect their content. So, to ensure a consistent signature, duplo sorts the list of hashes for each directory. This way, if Folder A has files file1.txt (hash A), file2.txt (hash B) and Folder B has file2.txt (hash B), file1.txt (hash A), they will both generate the same sorted hash list [hash A, hash B], and thus the same signature.
  • Comparison: If two directories have the exact same sorted "playlist" of file fingerprints, they are considered identical, even if their names or parent directories are different.

How duplo Uses the Identical Directory Finder

To activate the Identical Directory Finder, you need to use the --find-identical-dirs argument when running duplo. This tells duplo to perform this additional check after it has found all the individual duplicate files.

python main.py my_messy_folder --find-identical-dirs

After scanning, categorizing, and finding individual duplicate files, duplo will then proceed with finding identical directories. You'll see messages indicating this process:

[+] Начинаем сканирование директории: my_messy_folder
... (output from previous chapters) ...
[+] Найдено групп дубликатов: 5

[+] Поиск идентичных каталогов...
[+] Найдено групп идентичных каталогов: 2

Then, if identical directories are found, duplo will present them in a clear report:

============================================================
ИНТЕРАКТИВНЫЙ РЕЖИМ УПРАВЛЕНИЯ ДУБЛИКАТАМИ
============================================================

... (interactive file deletion for individual files) ...

[+] ОБРАБОТКА ИДЕНТИЧНЫХ КАТАЛОГОВ

Группа идентичных каталогов #1:
  [1] /home/user/my_messy_folder/Photos_Backup (100 файлов, 500.00 MB)
  [2] /home/user/archive/OldPhotos (100 файлов, 500.00 MB)

Группа идентичных каталогов #2:
  [1] /home/user/docs/Project_X_V1 (50 файлов, 20.00 MB)
  [2] /home/user/work/Project_X_Old (50 файлов, 20.00 MB)
  [3] /home/user/temp/Project_X_Copy (50 файлов, 20.00 MB)

This output shows you that entire folders (Photos_Backup and OldPhotos in Group #1) are identical, making it easy to decide to remove one of them, saving a lot of space and simplifying your file structure.

Under the Hood: How the Finder Works

Let's peek behind the curtain to understand how duplo creates these directory "playlists" and identifies identical folders.

Step-by-Step Walkthrough

  1. Collect File Hashes: The Duplicate Discovery Engine has already calculated MD5 hashes for many (if not all) of your files, especially those that might be duplicates. The Identical Directory Finder uses these existing hashes.
  2. Map Files to Directories: For every file duplo has processed, it identifies the parent directory it belongs to.
  3. Build Directory Hash Lists: For each directory found during the scan, duplo creates a list of all the hashes of the files that reside directly within that directory.
  4. Create Directory Signatures: To make the comparison robust (so that file order doesn't matter), duplo takes this list of hashes for each directory and sorts it. This sorted list (often converted to an immutable tuple in Python) becomes the unique "signature" of that directory.
  5. Group by Signature: duplo then groups all directories that share the exact same signature.
  6. Identify Identical Directories: Any group that contains more than one directory is a group of truly identical directories. These are reported to the user.

Here's a simple diagram illustrating this process:

sequenceDiagram
    participant DuploApp as "Duplo Application"
    participant Engine as "Duplicate Discovery Engine"
    participant DirFinder as "Identical Directory Finder"

    DuploApp->>Engine: "Scan and find file duplicates"
    Engine-->>DuploApp: Returns {hashA: [file1, file2], hashB: [file3]}
    Note over DuploApp: Also has full list of all files and their hashes

    DuploApp->>DirFinder: "Find identical directories (with file info)"
    Note over DirFinder: Uses file_path -> hash mapping

    DirFinder->>DirFinder: Creates 'dir_hashes':
    DirFinder->>DirFinder: - Folder A: [hashA, hashB]
    DirFinder->>DirFinder: - Folder B: [hashB, hashA]
    DirFinder->>DirFinder: - Folder C: [hashC]

    DirFinder->>DirFinder: Creates 'dir_signatures' by sorting hashes:
    DirFinder->>DirFinder: - Folder A: (hashA, hashB)
    DirFinder->>DirFinder: - Folder B: (hashA, hashB)
    DirFinder->>DirFinder: - Folder C: (hashC)

    DirFinder->>DirFinder: Groups by signature:
    DirFinder->>DirFinder: - (hashA, hashB): [Folder A, Folder B]
    DirFinder->>DirFinder: - (hashC): [Folder C]

    DirFinder-->>DuploApp: Returns: [[Folder A, Folder B]]
    DuploApp-->>DuploApp: Displays identical directory groups
Loading

The Code Behind the Identical Directory Finder

Let's look at the actual Python code from main.py that implements the Identical Directory Finder. The core is the find_identical_directories function.

1. Initial Setup and Mapping File Hashes to Directories: This part prepares to gather hashes for each directory.

# File: main.py
import os
from collections import defaultdict

# ... (other functions) ...

def find_identical_directories(statistics, duplicates):
    """
    Находит каталоги, которые содержат одинаковые наборы файлов.
    """
    print("\n[+] Поиск идентичных каталогов...")
    
    # Store hashes of files that are NOT duplicates, so we can count all files
    all_file_hashes = {}
    for file_info in all_files: # 'all_files' is the list from scan_directory
        file_path, file_size, mtime = file_info
        # Simplified: In reality, we'd ensure ALL files have hashes for this
        # For this tutorial, we focus on how 'duplicates' info is used.
        # Let's assume 'all_files' is extended with pre-calculated hashes.
        # For simplicity, we directly use the 'duplicates' dict which has hashes.
    
    # Create a reverse index: for each directory, collect its file hashes
    dir_hashes = defaultdict(list)
    
    # We use the 'duplicates' dictionary (hash -> list of files)
    # to populate 'dir_hashes'.
    for file_hash, file_paths in duplicates.items():
        for file_path in file_paths:
            dir_path = os.path.dirname(file_path) # Get parent directory path
            dir_hashes[dir_path].append(file_hash) # Add hash to that directory's list
    
    # ... more code below ...
    return identical_dirs
  • dir_hashes = defaultdict(list): This is a special dictionary where each key will be a directory path (e.g., /home/user/photos), and its value will be a list of all file hashes found directly within that directory.
  • The for file_hash, file_paths in duplicates.items(): loop iterates through all confirmed duplicate files. For each file, os.path.dirname(file_path) extracts its parent directory. Then, the file_hash is appended to the list associated with that dir_path in dir_hashes.
    • Self-correction: The description mentioned using all file hashes for a directory's signature, not just duplicate ones. The provided code snippet only uses hashes from the duplicates dict. To be fully accurate to the description, dir_hashes should be built from all files and their hashes, which would require scan_directory or find_duplicates_parallel to store and pass all file hashes. For this simplified tutorial, we will stick to the provided find_identical_directories using the duplicates dictionary as a starting point. A more robust implementation would ensure all files are hashed. I will add a small note in the code to reflect this simplification for tutorial purposes.

2. Creating Directory Signatures: Once we have lists of hashes for each directory, we sort them to create the unique signature.

# File: main.py (inside find_identical_directories function)
    # ... (previous code for dir_hashes) ...

    dir_signatures = {} # This will store directory path -> sorted hash tuple
    
    # Create signatures for each directory
    for dir_path, hashes in dir_hashes.items():
        # Sort hashes to create a consistent signature, regardless of file order
        # Convert to tuple to make it immutable and hashable for dictionary keys
        dir_signatures[dir_path] = tuple(sorted(hashes))
    
    # ... more code below ...
  • dir_signatures = {}: This new dictionary will map a dir_path to its unique "signature."
  • tuple(sorted(hashes)): For each directory's list of hashes, sorted(hashes) arranges them alphabetically/lexicographically. Then, tuple(...) converts this sorted list into an immutable tuple. Using a tuple is important because tuples can be used as keys in dictionaries, unlike lists. This sorted tuple is the directory's unique signature.

3. Grouping Identical Directories: Finally, we group directories that share the same signature.

# File: main.py (inside find_identical_directories function)
    # ... (previous code for dir_signatures) ...

    signature_groups = defaultdict(list) # Stores signature -> list of directory paths
    for dir_path, signature in dir_signatures.items():
        signature_groups[signature].append(dir_path) # Group directories by their signature
    
    # Keep only groups with more than one directory (these are our identical groups)
    identical_dirs = [dirs for dirs in signature_groups.values() if len(dirs) > 1]
    
    return identical_dirs
  • signature_groups = defaultdict(list): This dictionary will group directories. The key will be the signature (the sorted tuple of hashes), and the value will be a list of all dir_paths that have that exact signature.
  • The loop populates signature_groups.
  • identical_dirs = [dirs for dirs in signature_groups.values() if len(dirs) > 1]: This is the final filter. It takes all the lists of directories from signature_groups and only keeps those lists (dirs) that contain more than one directory. These are the confirmed groups of identical directories.

Conclusion

The Identical Directory Finder significantly extends duplo's power, moving beyond just finding individual duplicate files to identifying entire duplicate folders. By creating unique "signatures" based on the sorted content hashes of their files, duplo can accurately pinpoint redundant directory structures, even if they have different names or locations. This provides a higher level of insight for organizing and cleaning up your digital storage, allowing you to reclaim substantial disk space by removing entire redundant backups or archives.

Now that duplo can identify both individual duplicate files and entire duplicate directories, the next logical step is to provide you with a way to easily decide which copies to keep and which to delete. This crucial user interaction is handled by the Interactive Deletion Manager, which we'll explore in the next chapter: Interactive Deletion Manager.


Generated by AI Codebase Knowledge Builder. References: [1]

⚠️ **GitHub.com Fallback** ⚠️