02_file_categorizer_.md - it255ru/duplo GitHub Wiki

Chapter 2: File Categorizer

In the previous chapter, File System Scanner, we learned how duplo acts like a super-efficient librarian, creating a detailed inventory of every file in your chosen directory. It listed out the path, size, and last modified time for each "book" (file). But imagine if that librarian just gave you a giant list of every book in the library, without telling you if it's a novel, a cookbook, a textbook, or a photo album. It would still be pretty hard to find what you're looking for, wouldn't it?

This is where the File Categorizer comes into play. It takes that raw list of files and adds another crucial piece of information: their type or "genre."

What Problem Does It Solve?

After the scanner lists all your files, duplo needs to understand what kind of files they are. Is it an image? A video? A document? Knowing this makes the entire cleanup process much smarter and more manageable.

Use Case: You've just scanned your my_messy_folder and received a summary of 1234 files taking up 1.50 GB. You want to quickly see how much of that space is taken by pictures versus videos, and later, you'd like to find duplicate images specifically, not just any duplicate file.

To solve this, duplo needs to:

  1. Look at each file's identity (its extension).
  2. Assign it to a meaningful group (category).
  3. Provide a summary of these categories.

What is the File Categorizer?

Think of the File Categorizer as an automated sorting machine or a specialized librarian who knows exactly how to categorize every book. It assigns each file to a specific "department" or "genre" like "Images," "Videos," "Documents," or "Archives."

How does it know what category a file belongs to? It primarily uses the file's extension. The extension is the part of the filename after the last dot (e.g., .jpg for an image, .mp4 for a video, .docx for a document). duplo has a predefined list of extensions for each category.

How duplo Uses the Categorizer

The File Categorizer works automatically during the scanning phase, right after the File System Scanner finds a file. As it builds its inventory, duplo simultaneously categorizes each file.

When you run duplo to scan a directory, for example:

python main.py my_messy_folder

After the initial file system scan, duplo presents a "РАСПРЕДЕЛЕНИЕ ПО КАТЕГОРИЯМ" (DISTRIBUTION BY CATEGORIES) section in its report. This shows you exactly what types of files it found and how much space they occupy:

============================================================
РАСПРЕДЕЛЕНИЕ ПО КАТЕГОРИЯМ
============================================================
* IMAGES      :   500 файлов (40.5%),     800.00 MB
* VIDEOS      :    50 файлов (4.0%),      500.00 MB
* DOCUMENTS   :   300 файлов (24.3%),      100.00 MB
* ARCHIVES    :    10 файлов (0.8%),        50.00 MB
* OTHER       :   374 файлов (30.3%),        50.00 MB

This summary instantly tells you that most of your files are images, which also take up the most space. This is incredibly useful for understanding your data and for later filtering duplicate reports!

Under the Hood: How the Categorizer Works

Let's peek behind the curtain to understand how our sorting librarian (the File Categorizer) works its magic.

Step-by-Step Walkthrough

  1. Scanner Finds a File: The File System Scanner locates a file, let's say /home/user/my_messy_folder/holiday/sunset.jpg.
  2. Extract Extension: The categorizer looks at the file's full path and extracts its extension: .jpg.
  3. Lookup Category: It then checks this extension against its internal list of known categories.
  4. Assign Category: If .jpg is found in the "images" list, the file is assigned the "images" category.
  5. Update Inventory & Stats: This category information is added to the file's record, and duplo updates its internal statistics for the "images" category (increasing its file count and size).
  6. Repeat: This process happens for every single file found during the scan.

Here's a simple diagram illustrating this process:

sequenceDiagram
    participant User
    participant DuploApp as "Duplo Application"
    participant Scanner as "File System Scanner"
    participant Categorizer as "File Categorizer"

    User->>DuploApp: "Scan folder 'my_messy_folder'"
    DuploApp->>Scanner: "Start scan"

    Scanner->>Scanner: (finds a file: sunset.jpg)
    Scanner->>Categorizer: "What category is '.jpg'?"
    Categorizer->>Categorizer: (checks internal lists)
    Categorizer-->>Scanner: "Returns 'images'"
    Scanner->>Scanner: Add 'images' to file info and update category stats

    Scanner->>Scanner: (finds another file: video.mp4)
    Scanner->>Categorizer: "What category is '.mp4'?"
    Categorizer->>Categorizer: (checks internal lists)
    Categorizer-->>Scanner: "Returns 'videos'"
    Scanner->>Scanner: Add 'videos' to file info and update category stats

    Note over Scanner: All files processed and categorized
    Scanner-->>DuploApp: Returns complete inventory and category stats
    DuploApp-->>User: Displays scan results including category breakdown
Loading

The Code Behind the Categorizer

Let's look at the actual Python code from main.py that implements the File Categorizer.

1. The FILE_CATEGORIES Dictionary: duplo uses a dictionary named FILE_CATEGORIES to map file extensions to their respective categories.

# Словарь для классификации файлов по расширениям
FILE_CATEGORIES = {
    'images': {'.jpg', '.jpeg', '.png', '.gif', '.bmp', '.tiff', '.webp'}, # ... many more ...
    'videos': {'.mp4', '.avi', '.mov', '.wmv', '.flv', '.mkv', '.webm'}, # ... many more ...
    'audio': {'.mp3', '.wav', '.flac', '.aac', '.ogg', '.wma'}, # ... many more ...
    'documents': {'.pdf', '.doc', '.docx', '.txt', '.rtf', '.xls', '.xlsx'}, # ... many more ...
    'archives': {'.zip', '.rar', '.7z', '.tar', '.gz', '.bz2', '.xz'}, # ... many more ...
    'other': set() # For extensions not in any other category
}

This dictionary is like our librarian's reference book, listing which extensions belong to which genre. Notice that each category holds a set of extensions for quick lookups. The 'other' category catches anything that doesn't match a known type.

2. The get_file_category Function: This small function takes a file extension (like .jpg) and finds its category using the FILE_CATEGORIES dictionary.

def get_file_category(extension):
    """Определяет категорию файла по его расширению."""
    for category, extensions in FILE_CATEGORIES.items():
        if extension.lower() in extensions:
            return category
    return 'other' # If no match, it's an 'other' file

The function loops through each category in FILE_CATEGORIES. If the given extension (converted to lowercase for consistency) is found within a category's set of extensions, that category's name is returned. If no match is found after checking all categories, it defaults to 'other'.

3. Integration into scan_directory: The scan_directory function (which you saw in Chapter 1: File System Scanner) now incorporates the categorizer.

def scan_directory(directory):
    # ... (initial setup for all_files and stats) ...

    for root, dirs, files in os.walk(directory):
        for file in files:
            full_path = os.path.join(root, file)
            try:
                file_size = os.path.getsize(full_path)
                mtime = os.path.getmtime(full_path)
            except OSError:
                continue
            
            # ... (update total_files and total_size) ...

            ext = os.path.splitext(file)[1].lower() # Get the file extension
            
            # Use the categorizer to find the file's type
            category = get_file_category(ext) 
            
            # Update statistics for this category
            stats['by_category'][category]['count'] += 1
            stats['by_category'][category]['size'] += file_size

            # ... (update stats for by_extension and by_directory) ...

            all_files.append((full_path, file_size, mtime))

    return all_files, stats

Inside the main loop of scan_directory, after getting basic file info:

  • os.path.splitext(file)[1].lower() is used to get the file's extension (e.g., ".jpg") and convert it to lowercase.
  • get_file_category(ext) is called to determine the category for this file.
  • Finally, stats['by_category'][category]['count'] += 1 and stats['by_category'][category]['size'] += file_size ensure that duplo keeps a running total of files and their sizes for each category. This is what generates the "РАСПРЕДЕЛЕНИЕ ПО КАТЕГОРИЯМ" summary in the report.

Conclusion

The File Categorizer is a vital component of duplo, taking the raw inventory from the scanner and giving it structure and meaning. By classifying files into intuitive categories like "Images" or "Videos," duplo transforms a generic list into an organized, understandable overview. This categorization makes the later stages of finding and managing duplicates much more effective and user-friendly.

Now that we have a categorized list of files, the next step is to actually look inside these files to determine if their content is identical, not just their names or sizes. This important task is handled by the process of File Hashing & Comparison, which we'll explore in the next chapter: File Hashing & Comparison.


Generated by AI Codebase Knowledge Builder. References: [1]

⚠️ **GitHub.com Fallback** ⚠️