01_file_system_scanner_.md - it255ru/duplo GitHub Wiki

Chapter 1: File System Scanner

Imagine your computer's hard drive as a huge library. Over time, you've added many books (files) to different shelves (directories). Some books are big, some are small, and some you haven't touched in ages. If you wanted to find all the duplicate books or just get an idea of what kinds of books you have, where would you start? You'd probably need to make an inventory first!

This is exactly what the File System Scanner in duplo does. It's like a super-efficient librarian who meticulously goes through every shelf and lists out every book.

What Problem Does It Solve?

Before we can find duplicate files, categorize them, or decide what to delete, duplo needs to know what files even exist in the first place. The File System Scanner handles this crucial first step.

Use Case: You want to clean up a messy folder full of pictures, videos, and documents to find out what's taking up space and identify potential duplicates.

To solve this, duplo needs to:

  1. Go through every corner of the folder.
  2. Find every single file.
  3. Note down some basic details about each file.

Let's see how duplo starts this process.

How duplo Uses the Scanner

When you run duplo, you tell it which main directory (folder) to explore. For example, if you want to scan a folder named my_messy_folder:

python main.py my_messy_folder

This simple command kicks off the File System Scanner.

What Happens? The scanner will begin its work, and you'll see a message like this, indicating it's starting to build its inventory:

[+] Начинаем сканирование директории: my_messy_folder

After it finishes, the scanner provides a summary of its findings, which looks something like this (though it will be followed by more detailed information):

============================================================
СВОДНАя СТАТИСТИКА
============================================================
Общее количество файлов: 1234
Общий объем данных: 1.50 GB

This "inventory" is a list of all files found, along with their path, size, and when they were last changed. It's the foundation for everything else duplo does!

What Information Does the Scanner Collect?

The File System Scanner focuses on gathering only essential details about each file. Think of our librarian analogy again: the librarian notes the title, author, and publication date of each book, but they don't read the entire book's content at this stage.

For each file, the scanner notes:

  • Full Path: The exact location of the file on your computer (e.g., /home/user/my_messy_folder/pictures/holiday.jpg).
  • Size: How much space the file takes up (e.g., 2.5 MB).
  • Last Modified Time: When the file was last changed or saved.

Important: The scanner does not read the actual content inside your files. This keeps the scanning process fast and efficient.

Under the Hood: How the Scanner Works

Let's peek behind the curtain to understand how our diligent librarian (the File System Scanner) performs its task.

Step-by-Step Walkthrough

  1. Start at the Top: duplo receives the main directory you want to scan (e.g., my_messy_folder).
  2. Look Around: It opens this main directory and looks at everything inside: files and other sub-directories.
  3. Process Files: For every file it finds in the current directory, it quickly jots down its path, size, and last modified date.
  4. Dive Deeper: If it finds a sub-directory, it "steps into" that sub-directory and repeats steps 2 and 3. This continues until there are no more sub-directories to explore.
  5. Build the Inventory: As it goes, it keeps adding all this collected file information to one big list.
  6. Report Back: Once every file in every sub-directory has been listed, it provides the total count and size of all files found.

Here's a simple diagram illustrating this process:

sequenceDiagram
    participant User
    participant DuploApp as "Duplo Application"
    participant Scanner as "File System Scanner"
    participant FileSystem as "File System"

    User->>DuploApp: "Scan folder 'my_messy_folder'"
    DuploApp->>Scanner: "Start scan in 'my_messy_folder'"
    Note over Scanner: Initiates inventory collection

    Scanner->>FileSystem: "List items in 'my_messy_folder'"
    FileSystem-->>Scanner: Returns: File A, Subfolder X

    Scanner->>FileSystem: "Get info for File A"
    FileSystem-->>Scanner: Returns: Path, Size, Last Modified
    Scanner->>Scanner: Adds File A info to inventory

    Scanner->>FileSystem: "List items in 'Subfolder X'"
    FileSystem-->>Scanner: Returns: File B, File C

    Scanner->>FileSystem: "Get info for File B"
    FileSystem-->>Scanner: Returns: Path, Size, Last Modified
    Scanner->>Scanner: Adds File B info to inventory

    Scanner->>FileSystem: "Get info for File C"
    FileSystem-->>Scanner: Returns: Path, Size, Last Modified
    Scanner->>Scanner: Adds File C info to inventory

    Note over Scanner: All files processed
    Scanner-->>DuploApp: Returns complete file inventory and stats
    DuploApp-->>User: Displays scan results
Loading

The Code Behind the Scanner

Let's look at the actual Python code from main.py that implements this File System Scanner. The main part is a function called scan_directory.

1. The scan_directory function: This function takes the starting directory as input and will return two things: a list of all files with their details, and some overall statistics.

def scan_directory(directory):
    """Рекурсивно сканирует директорию и собирает статистику."""
    all_files = [] # This list will store our file inventory
    stats = {
        'total_files': 0,
        'total_size': 0,
        # ... other statistics we'll ignore for now ...
    }
    # ... more code below ...
    return all_files, stats

Here, all_files is the empty list that will eventually hold all the basic information about every file found. The stats dictionary is used to collect summary numbers like the total number of files and their combined size, which is useful for the initial report.

2. Walking through the directories: To explore every folder and subfolder, Python has a handy tool called os.walk.

    # ... inside scan_directory(directory):
    for root, dirs, files in os.walk(directory):
        # 'root' is the current folder path (e.g., /home/user/pictures)
        # 'dirs' is a list of subfolders in 'root'
        # 'files' is a list of files in 'root'
        for file in files:
            full_path = os.path.join(root, file)
            # ... get file info ...

The os.walk(directory) function is like our librarian moving through the library. It automatically navigates through all subfolders, presenting root (the current folder), dirs (folders inside root), and files (files directly inside root). We only care about the files in this inner loop.

3. Gathering File Information: For each file found, we need its full path, size, and last modified time.

            # ... inside the 'for file in files:' loop:
            full_path = os.path.join(root, file)
            try:
                file_size = os.path.getsize(full_path)
                mtime = os.path.getmtime(full_path)
            except OSError:
                continue # Skip files we can't access
            
            # ... update statistics ...

            all_files.append((full_path, file_size, mtime))
  • os.path.join(root, file) combines the current folder path (root) with the filename (file) to get the complete path to the file.
  • os.path.getsize(full_path) gets the file's size in bytes.
  • os.path.getmtime(full_path) gets the "modification time," which is a number representing when the file was last changed.
  • The try...except OSError block helps duplo gracefully handle files it might not have permission to read.
  • Finally, all_files.append((full_path, file_size, mtime)) adds all this crucial information for each file into our all_files list, building our complete inventory.

The stats dictionary is also updated in this loop (though not fully shown here) to keep track of totals, which is then presented in the "СВОДНАя СТАТИСТИКА" (Summary Statistics) section. This additional data helps you get an overview of your scanned directory.

Conclusion

You've just learned about the File System Scanner, the first and most fundamental part of duplo. It acts as the project's diligent librarian, meticulously cataloging every file's basic information (path, size, last modified time) without looking inside. This initial inventory is essential for all subsequent operations, much like a good foundation is crucial for any building.

Now that we have a list of all files and their basic properties, the next logical step is to understand what kind of files they are. This leads us to our next chapter: File Categorizer.


Generated by AI Codebase Knowledge Builder. References: [1]

⚠️ **GitHub.com Fallback** ⚠️