Understanding Project Directory Structure - inzamamshajahan/github-actions-learn4 GitHub Wiki

A well-organized project structure is fundamental for several reasons:

  • Clarity: Makes it easier for you and others to understand where different parts of the project reside.
  • Maintainability: Simplifies finding and modifying code, tests, and configuration.
  • Scalability: Allows the project to grow more components without becoming chaotic.
  • Tool Compatibility: Many development tools (linters, testers, build systems) expect or work best with standard layouts.
  • Collaboration: Provides a common framework when working in a team.

Let's break it down:


Overall Project Structure Philosophy

This project follows a common and recommended layout for Python applications, particularly those intended to be installable packages (even if simple) and that incorporate robust development practices like testing, linting, and CI/CD.

Key characteristics of this structure:

  • Source Code in src/: Placing the main application code inside a src directory (the "source layout") is a widely adopted best practice. It helps prevent accidental imports of your package from the root directory during development, ensuring that your tests and local runs behave more like they would in an installed environment.
  • Tests in tests/: Separating tests into a dedicated tests directory at the root level is standard.
  • Configuration at the Root: Project-wide configuration files (like pyproject.toml, .gitignore, .pre-commit-config.yaml) are kept at the project root for easy access and discovery.
  • Data Separation: Input/output data and logs are kept in a data/ directory, separating them from source code and tests.
  • Virtual Environment: The venv/ directory contains the project's isolated Python environment, crucial for dependency management.

Top-Level Files and Directories

Let's go through each significant item at the root of your my_python_project/ directory:

1. LICENSE

  • Purpose: This file defines the legal terms under which your software is distributed. It tells others what they can and cannot do with your code (e.g., use, modify, distribute).
  • Content: You've chosen the MIT License.
    • Why MIT? It's a permissive open-source license. It allows almost unrestricted use, modification, and distribution (both commercial and private), as long as the original copyright and license notice are included. It's popular for its simplicity and developer-friendliness.
    • Alternatives: Many other licenses exist (GPL, Apache 2.0, BSD, etc.), each with different terms. The choice depends on how you want your software to be used and shared. MIT is a common and good default for many open-source projects.
  • Importance: Crucial for any project you intend to share, especially open-source. It clarifies rights and obligations for users and contributors.

2. README.md

  • Purpose: This is the front page of your project. It's typically the first file someone will read when they encounter your project. It should provide an overview, setup instructions, usage examples, and other relevant information.
  • Content: Yours includes:
    • Project title and a brief description.
    • A "Features" section highlighting key aspects (data transformation, dependency management, CI/CD, etc.).
    • "Local Setup" instructions (cloning, virtual environment, installing dependencies, pre-commit hooks).
    • "Running the Script Locally" instructions.
    • Mention of the input, output, and log files generated (as per the updated instructions).
  • Why Markdown (.md)? Markdown is a lightweight markup language that's easy to write and read, and it renders nicely on platforms like GitHub, GitLab, etc.
  • Alternatives: Plain text files (README.txt), reStructuredText (README.rst - common in the Python ecosystem, especially for Sphinx documentation). Markdown is generally more common for general project READMEs due to its widespread adoption.
  • Importance: Essential for usability and understanding. A good README significantly lowers the barrier to entry for new users or contributors.

3. pyproject.toml

  • Purpose: This is the central configuration file for modern Python projects, as defined by PEP 517 and PEP 518 (and subsequent PEPs like 621, 660). It standardizes how build systems and development tools configure themselves.
  • Content:
    • [build-system] table:
      • Specifies the build backend (e.g., setuptools, poetry, flit, hatch). You're using setuptools.
      • requires: Lists packages needed to build your project (e.g., setuptools>=61.0).
      • build-backend: The Python object that build frontends (like pip) will call to build your package.
      • #backend-path = ["."] (commented out): If your build backend wasn't directly importable, this would tell tools where to find it. Not needed for standard setuptools.
    • [project] table (PEP 621 metadata):
      • Defines project metadata like name, version, description, readme, requires-python, license, authors.
      • dependencies: Lists runtime dependencies (e.g., pandas, numpy). These are what pip install your_project_name would install.
      • [project.optional-dependencies]: Defines optional sets of dependencies.
        • dev: You have a dev group for development tools (pytest, ruff, mypy, etc.). These are installed with pip install -e .[dev].
    • [tool.setuptools.packages.find] table:
      • Tells setuptools where to find your Python packages. where = ["src"] means it should look for packages inside the src/ directory.
    • [tool.ruff] table (and sub-tables lint, format):
      • Configuration for the Ruff linter and formatter.
      • line-length: Max line length (you set it to 200, which is longer than the common 88 or 120, but it's a project choice).
      • select: Specifies which Ruff rule codes to enable (e.g., E for pycodestyle errors, F for Pyflakes, I for isort import sorting, PT for pytest-style).
      • ignore: List of rules to disable.
      • quote-style, indent-style: Formatting preferences.
    • [tool.mypy] table:
      • Configuration for the Mypy static type checker.
      • python_version, warn_return_any, plugins, mypy_path = "src" (important for Mypy to find your source code correctly, especially with the src layout).
    • [tool.pytest.ini_options] table:
      • Configuration for Pytest.
      • minversion, addopts (command-line options like coverage reporting), testpaths.
    • [tool.bandit] table:
      • Placeholder for Bandit (security linter) configuration, though currently empty. Bandit can also read its config from here.
  • Why pyproject.toml?
    • Standardization: Replaces older, fragmented configuration files like setup.py (for building), setup.cfg, MANIFEST.in, and tool-specific files (like .isort.cfg, .flake8).
    • Declarative: Metadata is declarative, making it easier for tools to parse.
    • Tool Agnostic (for core metadata): The [project] table is standard, regardless of the build tool.
  • Alternatives:
    • Older projects might still use setup.py for build logic and metadata, often with a setup.cfg for declarative config.
    • Some projects might use separate config files for each tool (e.g., .ruff.toml, .mypy.ini).
  • Chosen because: It's the modern, standard way to configure Python projects and is supported by a growing number of tools. It centralizes configuration, making the project cleaner.

4. requirements.txt

  • Purpose: Traditionally, this file lists the project's runtime dependencies with specific versions (or version ranges) needed to run the application.
  • Content: Yours lists pandas and numpy.
  • Relationship with pyproject.toml:
    • In your current setup, pyproject.toml's [project].dependencies section also lists pandas and numpy. This is where the definitive list of runtime dependencies for your installable package is declared.
    • The requirements.txt file in this project seems to serve a specific purpose for the EC2 deployment script in your GitHub Actions workflow (pip install -r requirements.txt). This is a common pattern for CI/CD or deployment environments where you might want a simple, direct way to install only runtime dependencies without installing the project itself in editable mode or its dev dependencies.
  • Why have both?
    • pyproject.toml (dependencies): Defines the abstract dependencies of your installable package. pip or other installers resolve these.
    • requirements.txt: Can serve multiple purposes:
      1. Concrete dependencies for an application environment: Often generated by pip freeze > requirements.txt to pin exact versions for reproducible environments.
      2. Specific list for deployment: As used in your CI/CD script to install just pandas and numpy on EC2. This avoids installing all the dev tools on the production-like server.
  • Alternatives for deployment dependencies:
    • The deployment script could directly install pandas and numpy without a requirements.txt (e.g., pip install pandas numpy).
    • The deployment script could install the project using pip install . from the cloned repo on EC2 if the pyproject.toml correctly specified only runtime dependencies. This would install my_data_project_src_main and its dependencies.
  • Chosen because (likely): requirements.txt provides a simple, explicit list for the EC2 deployment script. While there's some duplication with pyproject.toml, it serves a clear role in the deployment pipeline.
    • Note: For consistency, it's good practice to ensure that the dependencies in requirements.txt are compatible with (or a subset of) those specified in pyproject.toml. Some tools can help synchronize these or generate requirements.txt from pyproject.toml.

5. .gitignore

  • Purpose: Specifies intentionally untracked files and directories that Git should ignore. This prevents committing temporary files, build artifacts, local configurations, sensitive data, and virtual environments.
  • Content: Your .gitignore is comprehensive and well-structured, covering:
    • Python bytecode (__pycache__, *.py[cod]).
    • Distribution/packaging artifacts (dist/, build/, *.egg-info/, *.whl).
    • Virtual environments (venv/, .venv/, etc.).
    • IDE/editor-specific files (.vscode/, .idea/).
    • OS-generated files (.DS_Store, Thumbs.db).
    • Testing/coverage files (.pytest_cache/, .coverage, htmlcov/).
    • Linter/Type checker caches (.mypy_cache/, .ruff_cache/).
    • Secrets or instance-specific config (instance/, *.local, *.env).
    • Jupyter Notebook checkpoints.
    • Specific data outputs (data/processed_output.csv).
    • Log files (*.log, data/*.log).
    • The documentation/ directory (this is a choice; sometimes docs are versioned, sometimes not if they are auto-generated and hosted elsewhere).
  • Why: Keeps the repository clean, reduces merge conflicts, avoids committing unnecessary or sensitive files, and makes git status more meaningful.
  • Importance: Essential for any Git repository.

6. .pre-commit-config.yaml

  • Purpose: Configures pre-commit hooks. Pre-commit is a framework for managing and maintaining multi-language pre-commit hooks. These hooks run checks on your code before you commit it, helping to enforce code style, linting, and other quality standards locally.
  • Content:
    • Defines repos (repositories containing hooks) and specific hooks to use from those repos.
    • pre-commit-hooks: Standard hooks like trailing-whitespace, end-of-file-fixer, check-yaml, check-added-large-files.
    • ruff-pre-commit: Hooks for running ruff check --fix (linting and auto-fixing) and ruff format (formatting). You've updated this to rev: 'v0.11.10' (which is actually an older Ruff version, the latest hook revisions point to newer Ruff, e.g. ruff-pre-commit v0.4.4 would use ruff ~0.4.4).
    • mirrors-mypy: Hook for running Mypy.
  • Why: Automates code quality checks locally, catching issues before they even reach CI or other developers. Promotes consistent code style across the project.
  • Importance: Highly recommended for maintaining code quality and consistency, especially in team environments.

7. .github/ directory (and workflows/ci-cd.yml)

  • Purpose: This directory is specifically for GitHub-related files.
    • workflows/: Contains GitHub Actions workflow definition files (YAML).
  • ci-cd.yml:
    • Purpose: Defines your Continuous Integration/Continuous Deployment (CI/CD) pipeline.
    • Content:
      • Triggers: Runs on pushes and pull requests to the main branch.
      • Jobs:
        • lint-test-analyze: Checks out code, sets up Python (matrix for multiple versions), installs dependencies (including dev), runs Ruff (lint & format check), Mypy (type checking), Bandit (code security), Safety (dependency security), and Pytest (unit tests with coverage).
        • deploy-and-run-on-ec2: Depends on lint-test-analyze succeeding and only runs on push to main. It checks out code, uses appleboy/ssh-action to connect to your EC2 instance (using GitHub Secrets for credentials), clones/updates the repo on EC2, sets up a virtual environment, installs runtime dependencies from requirements.txt, and runs src/main.py.
    • Why: Automates testing, linting, and deployment, ensuring code quality and providing a consistent way to deploy changes.
    • Importance: A cornerstone of modern software development for ensuring reliability and automating repetitive tasks.

8. data/ directory

  • Purpose: To store data files related to the project.
  • Content:
    • sample_input.csv: An example input file. In this project, it's committed, which can be useful for new users to quickly run the script or for baseline testing.
    • processed_output.csv: The output generated by the script. It's ignored by Git (as per .gitignore) because it's a generated artifact, not source.
    • data_processing.log: The log file generated by src/main.py. Also ignored by Git.
  • Why this separation? Keeps data files separate from source code (src/) and tests (tests/), improving organization.
  • Alternatives:
    • Placing data files directly in the root (less organized).
    • Placing them within src/ (generally not recommended for input/output data, as src/ is for installable package code).
  • Chosen because: Clear separation of concerns.

9. src/ directory

  • Purpose: Contains the source code of your Python application/library. This is the code that forms your installable package.
  • Content:
    • main.py: The main script for your data processing logic.
    • __pycache__/: Directory for Python's bytecode cache files. Automatically generated and ignored by Git.
    • my_data_project_src_main.egg-info/: Directory generated when you install your project in editable mode (pip install -e .) or build it. It contains metadata about the package (name, version, dependencies, entry points, etc.). This is a build artifact and is correctly ignored by Git.
  • Why the src layout?
    • Avoids namespace conflicts: Prevents the package from being accidentally importable from the project root during development if the current working directory is the project root. This forces you to install the package (even in editable mode) to import it, which better simulates how it will behave when actually installed.
    • Clear separation: Clearly distinguishes package source code from tests, scripts, documentation, etc.
  • Alternatives: Flat layout (placing package code directly in the project root alongside setup.py or pyproject.toml). The src layout is generally preferred for packages.

10. tests/ directory

  • Purpose: Contains all the test code for your project.
  • Content:
    • test_main.py: Unit tests for src/main.py.
    • __pycache__/: Bytecode cache for test files, ignored by Git.
  • Why: Standard practice to keep tests separate from source code but in the same repository. This allows test runners like pytest to easily discover and execute them.
  • Naming Convention: Test files are typically prefixed with test_ (e.g., test_main.py), and test functions within them are also prefixed with test_ (e.g., test_process_data_with_input_file). This is a convention recognized by pytest.

11. venv/ directory

  • Purpose: This directory contains the Python virtual environment for your project. A virtual environment is an isolated Python installation that allows you to manage project-specific dependencies without interfering with system-wide Python packages or other projects.
  • Content: Includes a copy/symlink of the Python interpreter, the site-packages directory (where project dependencies are installed), activation scripts (bin/activate), etc. The detailed file tree you provided under venv/ shows these components.
  • Why:
    • Dependency Isolation: Ensures your project uses the specific versions of libraries it needs, avoiding conflicts with other projects or the system Python.
    • Reproducibility: Makes it easier to reproduce the development environment on other machines.
  • Importance: Essential for virtually all Python development. It's correctly listed in your .gitignore as it's specific to your local setup and can be large and platform-dependent. Others will create their own venv based on pyproject.toml or requirements.txt.

12. documentation/ directory

  • Purpose: This directory was created by you to store project documentation.
  • Content: Contains 100.md (likely a file where you are drafting documentation).
  • Git Status: Your .gitignore includes documentation/, meaning it's currently not tracked by Git.
    • Why ignore? If documentation is auto-generated (e.g., using Sphinx from docstrings) and hosted elsewhere (like Read the Docs), then the generated HTML/PDF in documentation/ might be ignored.
    • Why track? If it contains manually written source files for documentation (like Markdown files that are then built or just read directly), it's common to version control these.
  • Decision: You'll need to decide if you want to version your documentation source files. If 100.md is a source file, you might want to remove documentation/ from .gitignore (or use more specific ignores like documentation/_build/ if using a tool like Sphinx).

13. directory_structure.txt

  • Purpose: A file you likely created to list or visualize the project's directory tree. Useful for understanding or explaining the layout.
  • Git Status: Not typically versioned unless it's meant to be a persistent part of the project's documentation about its own structure.

14. .coverage

  • Purpose: This file is generated by the coverage.py tool (often used via pytest-cov). It stores data about which lines of your code were executed during your test runs.
  • Why: Used to generate coverage reports (e.g., HTML reports in htmlcov/ or terminal output) that show test coverage metrics.
  • Git Status: Correctly ignored by .gitignore as it's a build/test artifact that changes with each test run.

15. .pytest_cache/

  • Purpose: This directory is created by pytest to cache information about test runs. This can speed up subsequent test runs, especially when using options like --lf (last-failed) or --ff (failed-first).
  • Content: Contains files like CACHEDIR.TAG, README.md (from pytest itself), and subdirectories (v/cache/) storing node IDs, last failed tests, etc.
  • Git Status: Correctly ignored by .gitignore as it's a local cache specific to your test environment.

This detailed breakdown should give you a good understanding of why your project is structured the way it is, and the role each key file and directory plays. This structure is robust and follows modern Python development conventions.