Understanding .github workflows ci‐cd.yml - inzamamshajahan/github-actions-learn4 GitHub Wiki

This file is the heart of the project's automation, defining how GitHub Actions will build, test, and deploy your code.

Overall Purpose:

This YAML file defines a GitHub Actions workflow. GitHub Actions is a platform that allows you to automate your software development workflows directly within your GitHub repository. This specific workflow, named "Python CI/CD - Data Script (src/main.py) with Logging," is designed to implement a Continuous Integration (CI) and Continuous Deployment (CD) pipeline for your Python data transformation script.

  • Continuous Integration (CI): The first part of the workflow focuses on CI. Every time code is pushed to the main branch or a pull request is made to main, this part automatically runs a series of checks:

    • Linting: Ensures code style consistency and catches potential syntax errors (using Ruff).
    • Formatting: Verifies that the code adheres to defined formatting rules (using Ruff format).
    • Static Type Checking: Catches type-related errors before runtime (using Mypy).
    • Security Scanning: Checks for common security vulnerabilities in the code (using Bandit) and in the dependencies (using Safety).
    • Testing: Executes unit tests to verify the correctness of the code's logic (using Pytest), including code coverage analysis. This CI process is run across multiple Python versions to ensure broader compatibility.
  • Continuous Deployment (CD): The second part of the workflow handles CD. If all CI checks pass on a push to the main branch, this part automatically deploys and runs your src/main.py script on a pre-configured AWS EC2 instance.

Why this workflow?

  • Automation: Reduces manual effort in testing and deployment.
  • Quality Assurance: Catches errors, style issues, and vulnerabilities early in the development cycle.
  • Consistency: Ensures that all code merged into main meets defined quality standards and that deployments are performed in a standardized way.
  • Confidence: Provides confidence that changes are safe to deploy after passing all automated checks.
  • Collaboration: Makes it easier for multiple developers to contribute by having an automated gatekeeper for code quality.

Top-Level Workflow Configuration:

name: Python CI/CD - Data Script (src/main.py) with Logging
  • name:
    • What: A human-readable name for the workflow. This name is displayed on the "Actions" tab of your GitHub repository.
    • Why: Provides a clear identifier for this specific workflow, especially if you have multiple workflows in a repository.
on:
  push:
    branches: [ "main" ]
  pull_request:
    branches: [ "main" ]
  • on:
    • What: Defines the events that trigger the workflow to run.
    • push: The workflow runs whenever someone pushes commits to the repository.
      • branches: [ "main" ]: This restricts the push trigger to only run when pushes are made to the main branch.
    • pull_request: The workflow runs whenever a pull request is opened, synchronized (new commits pushed to the PR branch), or reopened.
      • branches: [ "main" ]: This means the workflow runs for pull requests that target the main branch.
    • Why these triggers?
      • Pull Requests to main: This is crucial for CI. Before merging any code into main, all checks are run on the proposed changes to ensure they don't break anything or introduce issues.
      • Pushes to main: This is essential for CD. Once changes are merged into main (and have presumably passed PR checks), this trigger initiates the deployment process. It also covers direct pushes to main (though often protected by branch rules).
    • Alternatives:
      • Specific tags: on: push: tags: [ 'v*' ] (for release-triggered deployments).
      • Scheduled runs: on: schedule: - cron: '0 0 * * *' (for nightly builds, for example).
      • Manual dispatch: on: workflow_dispatch (to trigger the workflow manually from the GitHub UI).
      • Specific file paths: Triggering only if certain files/directories change (on: push: paths: [ 'src/**' ]).
    • Chosen because: The combination of push and pull_request on the main branch is a very common and effective setup for CI/CD, ensuring code is validated before merge and deployed after merge.

Jobs:

jobs:
  # ... job definitions ...
  • jobs:
    • What: A workflow run is made up of one or more jobs, which run in parallel by default. Each job runs in its own runner environment.
    • This workflow has two jobs: lint-test-analyze and deploy-and-run-on-ec2.

Job 1: lint-test-analyze

  lint-test-analyze:
    name: Lint, Test & Analyze
    runs-on: ubuntu-latest
    strategy:
      matrix:
        python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
  • lint-test-analyze:
    • What: The identifier for this job.
  • name: Lint, Test & Analyze:
    • What: A human-readable name for the job displayed in the GitHub UI.
  • runs-on: ubuntu-latest:
    • What: Specifies the type of machine (runner) to run the job on. GitHub provides hosted runners for Linux, Windows, and macOS.
    • Why ubuntu-latest? Linux environments are common for CI, widely supported, and often cost-effective. latest ensures you're using a recent version with up-to-date tools.
    • Alternatives: windows-latest, macos-latest, or self-hosted runners (if you need specific hardware, software, or network configurations not available on GitHub-hosted runners).
  • strategy: matrix::
    • What: Defines a build matrix. This allows you to run the same job multiple times with different configurations.
    • python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]: This creates a matrix where the lint-test-analyze job will run once for each Python version specified in this list. The variable matrix.python-version will be available within the job steps.
    • Why use a matrix? To ensure your code is compatible with and behaves correctly across multiple Python versions that you intend to support.
    • Alternatives: Testing only a single Python version (simpler but less robust if you aim for broad compatibility).

Steps within lint-test-analyze:

Each step is an individual task executed sequentially within the job.

  1. name: Checkout code

    - name: Checkout code
      uses: actions/checkout@v4
    
    • uses: actions/checkout@v4: This uses a pre-built, official GitHub Action.
    • Purpose: To check out your repository's code onto the runner so the workflow can access it. Version @v4 is used for stability.
    • Alternative: Manually scripting git clone commands, which is more verbose and error-prone.
  2. name: Set up Python ${{ matrix.python-version }}

    - name: Set up Python ${{ matrix.python-version }}
      uses: actions/setup-python@v4
      with:
        python-version: ${{ matrix.python-version }}
        cache: 'pip'
    
    • uses: actions/setup-python@v4: Another official GitHub Action.
    • Purpose: To set up a specific version of Python on the runner.
    • with: python-version: ${{ matrix.python-version }}: This dynamically inserts the Python version from the build matrix (e.g., "3.8", then "3.9", etc., in separate job runs).
    • cache: 'pip':
      • Purpose: Enables caching of pip dependencies. On the first run for a Python version, dependencies are downloaded and installed. This step will cache them. Subsequent runs for the same Python version and same dependencies can restore these dependencies from the cache, speeding up the workflow significantly.
      • Why: Reduces build times and network usage.
  3. name: Install dependencies

    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -e .[dev]
    
    • run: |: Executes shell commands. The | indicates a multi-line script.
    • python -m pip install --upgrade pip: Upgrades pip to its latest version within the runner's Python environment. This is a good practice to ensure you're using the latest pip features and bug fixes.
    • pip install -e .[dev]:
      • Installs the project itself in "editable" mode (-e .). This means changes to the source code are immediately reflected without needing a reinstall, and the package is available on the PYTHONPATH.
      • [dev]: This installs the optional dependencies listed under the dev group in your pyproject.toml file (pytest, ruff, mypy, bandit, safety, etc.). This is crucial because these tools are needed for the subsequent linting and testing steps.
    • Why this approach? Ensures all necessary tools and the project itself are available in the environment for the CI checks.
  4. name: Lint and Format Check with Ruff

    - name: Lint and Format Check with Ruff
      run: |
        ruff check .
        ruff format --check .
    
    • ruff check .: Runs Ruff to check for linting issues (style violations, potential bugs based on selected rules in pyproject.toml) across all files in the current directory (.).
    • ruff format --check .: Runs Ruff's formatter in "check" mode. It doesn't modify files but will exit with an error if files are not formatted according to the rules, ensuring code formatting consistency.
    • Why these commands? Automates code style and formatting enforcement. The --check for format is important in CI to fail if formatting is incorrect, rather than silently reformatting.
  5. name: Static type checking with Mypy

    - name: Static type checking with Mypy
      run: mypy src tests --config-file pyproject.toml
    ```    *   **`mypy src tests --config-file pyproject.toml`**: Runs Mypy to perform static type checking on the `src` and `tests` directories.
    *   `--config-file pyproject.toml`: Tells Mypy to use its configuration from the `[tool.mypy]` section of your `pyproject.toml`.
    *   **Why Mypy?** Helps catch type errors before runtime, improving code reliability and maintainability.
    
    
  6. name: Security scan (code) with Bandit

    - name: Security scan (code) with Bandit
      run: bandit -r src -c pyproject.toml
    
    • bandit -r src -c pyproject.toml: Runs Bandit to scan the src directory recursively (-r) for common security vulnerabilities in Python code.
    • -c pyproject.toml: Instructs Bandit to look for its configuration within the pyproject.toml file (under [tool.bandit]).
    • Why Bandit? Helps identify potential security risks like hardcoded passwords, use of unsafe deserialization, etc.
  7. name: Security scan (dependencies) with Safety

    - name: Security scan (dependencies) with Safety
      run: |
        pip freeze > current_requirements.txt
        safety check -r current_requirements.txt
    
    • pip freeze > current_requirements.txt: Generates a list of all installed packages (including their exact versions and transitive dependencies) in the current environment and saves it to current_requirements.txt.
    • safety check -r current_requirements.txt: Runs Safety to check the packages listed in current_requirements.txt against a database of known security vulnerabilities.
    • Why Safety? Helps identify if any of your project's dependencies (direct or indirect) have known security issues that could affect your application.
    • Note on safety check deprecation: The logs indicate safety check is deprecated in favor of safety scan. You should update this to safety scan -r current_requirements.txt when Safety version 4.x or later is used, or adjust as per Safety's latest recommendations.
  8. name: Run tests with Pytest

    - name: Run tests with Pytest
      run: pytest
    
    • pytest: Executes Pytest. Pytest will automatically discover test files (e.g., test_*.py or *_test.py) and test functions (prefixed with test_) in the tests directory (as configured in pyproject.toml). It will also use other configurations from pyproject.toml, such as code coverage options (--cov=main, --cov-report=term-missing, etc.).
    • Why Pytest? A robust testing framework that integrates well with other tools and has a clean syntax.

Job 2: deploy-and-run-on-ec2

This job is responsible for deploying and running the script on your AWS EC2 instance.

  deploy-and-run-on-ec2:
    name: Deploy and Run Script on EC2
    needs: lint-test-analyze
    if: github.ref == 'refs/heads/main' && github.event_name == 'push'
    runs-on: ubuntu-latest
  • deploy-and-run-on-ec2: The identifier for this job.
  • name: Deploy and Run Script on EC2: Human-readable name.
  • needs: lint-test-analyze:
    • What: Specifies that this job depends on the successful completion of the lint-test-analyze job.
    • Why: Ensures that code is only deployed if all CI checks (linting, testing, security scans) have passed. This is a fundamental principle of safe CI/CD.
  • if: github.ref == 'refs/heads/main' && github.event_name == 'push':
    • What: A conditional statement that determines if the job should run. It uses GitHub Actions context variables.
      • github.ref: The branch or tag ref that triggered the workflow. refs/heads/main refers to the main branch.
      • github.event_name: The name of the event that triggered the workflow (e.g., push, pull_request).
    • Why: This condition ensures that this deployment job only runs when there is a direct push to the main branch. It will not run for pull requests, even if they target main (those are handled by the CI job only). This is a standard practice to prevent deployments from feature branches or PRs directly.
  • runs-on: ubuntu-latest: Specifies the runner for this job.

Steps within deploy-and-run-on-ec2:

  1. name: Checkout code

    - name: Checkout code
      uses: actions/checkout@v4
    
    • Purpose: Checks out the repository code again. Even though it was checked out in the previous job, each job runs in a fresh environment, so the code needs to be available.
  2. name: Deploy to EC2 and Run Script

    - name: Deploy to EC2 and Run Script
      uses: appleboy/[email protected]
      with:
        host: ${{ secrets.EC2_HOST }}
        username: ${{ secrets.EC2_USERNAME }}
        key: ${{ secrets.EC2_SSH_PRIVATE_KEY }}
        port: ${{ secrets.EC2_PORT }} # Default is 22 if not set
        script: |
          # ... (multi-line shell script) ...
    
    • uses: appleboy/[email protected]: This uses a popular third-party GitHub Action designed to simplify SSH operations.
    • with:: Provides inputs to the ssh-action.
      • host: ${{ secrets.EC2_HOST }}: The hostname or IP address of your EC2 instance.
      • username: ${{ secrets.EC2_USERNAME }}: The username for SSH login.
      • key: ${{ secrets.EC2_SSH_PRIVATE_KEY }}: The private SSH key for authentication.
      • port: ${{ secrets.EC2_PORT }}: The SSH port on the EC2 instance. The original instructions had default('22') here, which is good practice; your current file just has port: ${{ secrets.EC2_PORT }} which means if the secret EC2_PORT isn't set, the action might fail or use its own internal default. Using | default('22') is more robust if port 22 is the common case.
      • Importance of secrets: These values are sensitive and are stored as encrypted secrets in your GitHub repository settings (Settings > Secrets and variables > Actions). This prevents hardcoding credentials in your workflow file.
    • script: |: This multi-line block contains the shell commands that will be executed on the EC2 instance via SSH.
      • set -e: Exits immediately if a command exits with a non-zero status. This ensures the script fails fast if any step in the deployment process goes wrong.
      • export APP_DIR="/opt/my_data_project_src_main": Defines an environment variable for the application directory on the EC2 instance. Using /opt/ is a common convention for add-on software packages.
      • sudo mkdir -p $APP_DIR: Creates the application directory if it doesn't exist. -p ensures no error if the directory already exists and creates parent directories if needed. sudo is used because /opt is typically writable only by root.
      • sudo chown ${{ secrets.EC2_USERNAME }}:${{ secrets.EC2_USERNAME }} $APP_DIR: Changes the ownership of the application directory to the SSH user. This is crucial so that subsequent commands (like git clone and pip install within a venv) can be run as the SSH user without needing sudo for every file operation within APP_DIR.
      • cd $APP_DIR: Navigates into the application directory.
      • Git Clone/Update Logic:
        if [ ! -d ".git" ]; then
          git clone https://github.com/${{ github.repository }}.git .
        else
          git remote set-url origin https://github.com/${{ github.repository }}.git
          git fetch origin main --prune
          git reset --hard origin/main
          git clean -fdx
        fi
        
        • This block handles both the first-time deployment (cloning the repository) and subsequent updates.
        • if [ ! -d ".git" ]: Checks if it's a new deployment by looking for the .git directory.
        • git clone ... .: Clones into the current directory (.).
        • git remote set-url ...: Ensures the remote URL is correct (good for robustness).
        • git fetch origin main --prune: Fetches the latest changes from the main branch of the origin remote and removes any remote-tracking branches that no longer exist on the remote (--prune).
        • git reset --hard origin/main: Forces the local main branch to exactly match the state of origin/main, discarding any local changes or commits on the server. This ensures a clean and predictable state.
        • git clean -fdx: Removes all untracked files (-f for force, -d for directories, -x for ignored files too). This ensures that any artifacts from previous builds or runs are cleaned up, providing a pristine environment.
        • Idempotency: This set of commands is designed to be idempotent, meaning running it multiple times will achieve the same result as running it once.
      • Python Virtual Environment Setup on EC2:
        if [ ! -d "venv" ]; then
          python3 -m venv venv
        fi
        source venv/bin/activate
        
        • Creates a virtual environment named venv if it doesn't already exist.
        • Activates the virtual environment. All subsequent pip and python commands will use this isolated environment.
      • Installing Runtime Dependencies on EC2:
        pip install --upgrade pip
        pip install -r requirements.txt
        
        • Upgrades pip within the virtual environment.
        • Installs only the runtime dependencies (pandas, numpy) specified in requirements.txt. This is efficient as it doesn't install the large set of development tools on the "production-like" EC2 instance.
      • Running the Data Processing Script:
        python src/main.py
        
        • Executes your main Python script. The PROJECT_ROOT logic within src/main.py will correctly resolve paths relative to $APP_DIR/src/main.py on the EC2 instance.
      • Informational echo statements: Provide feedback in the GitHub Actions log about the deployment process and where to find output files on the EC2 instance.

Security and Best Practices in the Workflow:

  • GitHub Secrets: Correctly used for sensitive information like EC2 host, username, and SSH key.
  • Principle of Least Privilege (Partial): While sudo is used for initial directory creation and ownership change, subsequent operations within APP_DIR are performed as the non-root SSH user. The SSH key used should ideally have restricted permissions on the EC2 instance if possible.
  • Idempotent Deployment Script: The script on EC2 attempts to be idempotent, which is good for re-runnability.
  • Dependency Pinning (Implicit):
    • For CI, pip install -e .[dev] installs dependencies based on pyproject.toml. If versions there are pinned or have tight ranges, it's good. If they are loose, the CI environment might vary slightly over time.
    • For CD, pip install -r requirements.txt is used. The reproducibility here depends on how requirements.txt is managed. If it's generated with pip freeze from a known good environment, it pins versions. If manually curated with loose versions, the EC2 environment might vary.
  • Fail Fast: set -e in the deployment script and the fact that GitHub Actions steps fail on non-zero exit codes help in identifying issues quickly.

Potential Improvements or Alternatives:

  • EC2 Setup on Runner vs. SSH Action: For very complex EC2 interactions, one might consider using AWS-specific GitHub Actions (e.g., to use AWS CLI, SSM Run Command) rather than a generic SSH action, though appleboy/ssh-action is fine for this scale.
  • Artifact-Based Deployment: Instead of git clone on the EC2 instance, the CI job could build a distributable artifact (like a Python wheel or a zip file containing the script and dependencies). This artifact would then be transferred to EC2 and deployed. This can be more robust and faster for deployment.
  • Configuration Management for EC2: For managing the EC2 instance itself (ensuring Python, Git, etc., are installed), tools like Ansible, Chef, or Puppet, or using pre-baked AMIs (Amazon Machine Images) would be more robust solutions for production environments.
  • Containerization (Docker): A very common approach is to containerize the Python application using Docker. The CI pipeline would build a Docker image, push it to a registry (like Docker Hub or AWS ECR), and the CD part would pull and run this image on EC2 (or a container orchestration service like ECS or EKS). This provides greater environment consistency.
  • More Specific runs-on: Instead of ubuntu-latest, specifying a particular version like ubuntu-22.04 can provide more stability over time, as latest can change and potentially introduce unexpected behavior.
  • Safety Command Update: As noted in the logs, update safety check to safety scan or the current recommended command by the Safety tool.
  • Workflow Reusability: For more complex projects with multiple similar workflows, GitHub Actions offers "Reusable Workflows" or "Composite Actions" to avoid duplication.
  • Environment Variables for APP_DIR: Instead of export APP_DIR in the script, you could potentially set it as an environment variable for the ssh-action if the action supports it, or pass it as an argument if the script was designed to take it.

This detailed documentation should clarify the purpose, design, and operation of your ci-cd.yml workflow file. It's a solid foundation for automating your project's lifecycle.