Understanding .github workflows ci‐cd.yml - inzamamshajahan/github-actions-learn4 GitHub Wiki
This file is the heart of the project's automation, defining how GitHub Actions will build, test, and deploy your code.
Overall Purpose:
This YAML file defines a GitHub Actions workflow. GitHub Actions is a platform that allows you to automate your software development workflows directly within your GitHub repository. This specific workflow, named "Python CI/CD - Data Script (src/main.py) with Logging," is designed to implement a Continuous Integration (CI) and Continuous Deployment (CD) pipeline for your Python data transformation script.
-
Continuous Integration (CI): The first part of the workflow focuses on CI. Every time code is pushed to the
main
branch or a pull request is made tomain
, this part automatically runs a series of checks:- Linting: Ensures code style consistency and catches potential syntax errors (using Ruff).
- Formatting: Verifies that the code adheres to defined formatting rules (using Ruff format).
- Static Type Checking: Catches type-related errors before runtime (using Mypy).
- Security Scanning: Checks for common security vulnerabilities in the code (using Bandit) and in the dependencies (using Safety).
- Testing: Executes unit tests to verify the correctness of the code's logic (using Pytest), including code coverage analysis. This CI process is run across multiple Python versions to ensure broader compatibility.
-
Continuous Deployment (CD): The second part of the workflow handles CD. If all CI checks pass on a push to the
main
branch, this part automatically deploys and runs yoursrc/main.py
script on a pre-configured AWS EC2 instance.
Why this workflow?
- Automation: Reduces manual effort in testing and deployment.
- Quality Assurance: Catches errors, style issues, and vulnerabilities early in the development cycle.
- Consistency: Ensures that all code merged into
main
meets defined quality standards and that deployments are performed in a standardized way. - Confidence: Provides confidence that changes are safe to deploy after passing all automated checks.
- Collaboration: Makes it easier for multiple developers to contribute by having an automated gatekeeper for code quality.
Top-Level Workflow Configuration:
name: Python CI/CD - Data Script (src/main.py) with Logging
name
:- What: A human-readable name for the workflow. This name is displayed on the "Actions" tab of your GitHub repository.
- Why: Provides a clear identifier for this specific workflow, especially if you have multiple workflows in a repository.
on:
push:
branches: [ "main" ]
pull_request:
branches: [ "main" ]
on
:- What: Defines the events that trigger the workflow to run.
push
: The workflow runs whenever someone pushes commits to the repository.branches: [ "main" ]
: This restricts thepush
trigger to only run when pushes are made to themain
branch.
pull_request
: The workflow runs whenever a pull request is opened, synchronized (new commits pushed to the PR branch), or reopened.branches: [ "main" ]
: This means the workflow runs for pull requests that target themain
branch.
- Why these triggers?
- Pull Requests to
main
: This is crucial for CI. Before merging any code intomain
, all checks are run on the proposed changes to ensure they don't break anything or introduce issues. - Pushes to
main
: This is essential for CD. Once changes are merged intomain
(and have presumably passed PR checks), this trigger initiates the deployment process. It also covers direct pushes tomain
(though often protected by branch rules).
- Pull Requests to
- Alternatives:
- Specific tags:
on: push: tags: [ 'v*' ]
(for release-triggered deployments). - Scheduled runs:
on: schedule: - cron: '0 0 * * *'
(for nightly builds, for example). - Manual dispatch:
on: workflow_dispatch
(to trigger the workflow manually from the GitHub UI). - Specific file paths: Triggering only if certain files/directories change (
on: push: paths: [ 'src/**' ]
).
- Specific tags:
- Chosen because: The combination of
push
andpull_request
on themain
branch is a very common and effective setup for CI/CD, ensuring code is validated before merge and deployed after merge.
Jobs:
jobs:
# ... job definitions ...
jobs
:- What: A workflow run is made up of one or more jobs, which run in parallel by default. Each job runs in its own runner environment.
- This workflow has two jobs:
lint-test-analyze
anddeploy-and-run-on-ec2
.
Job 1: lint-test-analyze
lint-test-analyze:
name: Lint, Test & Analyze
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
lint-test-analyze
:- What: The identifier for this job.
name: Lint, Test & Analyze
:- What: A human-readable name for the job displayed in the GitHub UI.
runs-on: ubuntu-latest
:- What: Specifies the type of machine (runner) to run the job on. GitHub provides hosted runners for Linux, Windows, and macOS.
- Why
ubuntu-latest
? Linux environments are common for CI, widely supported, and often cost-effective.latest
ensures you're using a recent version with up-to-date tools. - Alternatives:
windows-latest
,macos-latest
, or self-hosted runners (if you need specific hardware, software, or network configurations not available on GitHub-hosted runners).
strategy: matrix:
:- What: Defines a build matrix. This allows you to run the same job multiple times with different configurations.
python-version: ["3.8", "3.9", "3.10", "3.11", "3.12"]
: This creates a matrix where thelint-test-analyze
job will run once for each Python version specified in this list. The variablematrix.python-version
will be available within the job steps.- Why use a matrix? To ensure your code is compatible with and behaves correctly across multiple Python versions that you intend to support.
- Alternatives: Testing only a single Python version (simpler but less robust if you aim for broad compatibility).
Steps within lint-test-analyze
:
Each step
is an individual task executed sequentially within the job.
-
name: Checkout code
- name: Checkout code uses: actions/checkout@v4
uses: actions/checkout@v4
: This uses a pre-built, official GitHub Action.- Purpose: To check out your repository's code onto the runner so the workflow can access it. Version
@v4
is used for stability. - Alternative: Manually scripting
git clone
commands, which is more verbose and error-prone.
-
name: Set up Python ${{ matrix.python-version }}
- name: Set up Python ${{ matrix.python-version }} uses: actions/setup-python@v4 with: python-version: ${{ matrix.python-version }} cache: 'pip'
uses: actions/setup-python@v4
: Another official GitHub Action.- Purpose: To set up a specific version of Python on the runner.
with: python-version: ${{ matrix.python-version }}
: This dynamically inserts the Python version from the build matrix (e.g., "3.8", then "3.9", etc., in separate job runs).cache: 'pip'
:- Purpose: Enables caching of pip dependencies. On the first run for a Python version, dependencies are downloaded and installed. This step will cache them. Subsequent runs for the same Python version and same dependencies can restore these dependencies from the cache, speeding up the workflow significantly.
- Why: Reduces build times and network usage.
-
name: Install dependencies
- name: Install dependencies run: | python -m pip install --upgrade pip pip install -e .[dev]
run: |
: Executes shell commands. The|
indicates a multi-line script.python -m pip install --upgrade pip
: Upgrades pip to its latest version within the runner's Python environment. This is a good practice to ensure you're using the latest pip features and bug fixes.pip install -e .[dev]
:- Installs the project itself in "editable" mode (
-e .
). This means changes to the source code are immediately reflected without needing a reinstall, and the package is available on thePYTHONPATH
. [dev]
: This installs the optional dependencies listed under thedev
group in yourpyproject.toml
file (pytest, ruff, mypy, bandit, safety, etc.). This is crucial because these tools are needed for the subsequent linting and testing steps.
- Installs the project itself in "editable" mode (
- Why this approach? Ensures all necessary tools and the project itself are available in the environment for the CI checks.
-
name: Lint and Format Check with Ruff
- name: Lint and Format Check with Ruff run: | ruff check . ruff format --check .
ruff check .
: Runs Ruff to check for linting issues (style violations, potential bugs based on selected rules inpyproject.toml
) across all files in the current directory (.
).ruff format --check .
: Runs Ruff's formatter in "check" mode. It doesn't modify files but will exit with an error if files are not formatted according to the rules, ensuring code formatting consistency.- Why these commands? Automates code style and formatting enforcement. The
--check
for format is important in CI to fail if formatting is incorrect, rather than silently reformatting.
-
name: Static type checking with Mypy
- name: Static type checking with Mypy run: mypy src tests --config-file pyproject.toml ``` * **`mypy src tests --config-file pyproject.toml`**: Runs Mypy to perform static type checking on the `src` and `tests` directories. * `--config-file pyproject.toml`: Tells Mypy to use its configuration from the `[tool.mypy]` section of your `pyproject.toml`. * **Why Mypy?** Helps catch type errors before runtime, improving code reliability and maintainability.
-
name: Security scan (code) with Bandit
- name: Security scan (code) with Bandit run: bandit -r src -c pyproject.toml
bandit -r src -c pyproject.toml
: Runs Bandit to scan thesrc
directory recursively (-r
) for common security vulnerabilities in Python code.-c pyproject.toml
: Instructs Bandit to look for its configuration within thepyproject.toml
file (under[tool.bandit]
).- Why Bandit? Helps identify potential security risks like hardcoded passwords, use of unsafe deserialization, etc.
-
name: Security scan (dependencies) with Safety
- name: Security scan (dependencies) with Safety run: | pip freeze > current_requirements.txt safety check -r current_requirements.txt
pip freeze > current_requirements.txt
: Generates a list of all installed packages (including their exact versions and transitive dependencies) in the current environment and saves it tocurrent_requirements.txt
.safety check -r current_requirements.txt
: Runs Safety to check the packages listed incurrent_requirements.txt
against a database of known security vulnerabilities.- Why Safety? Helps identify if any of your project's dependencies (direct or indirect) have known security issues that could affect your application.
- Note on
safety check
deprecation: The logs indicatesafety check
is deprecated in favor ofsafety scan
. You should update this tosafety scan -r current_requirements.txt
when Safety version 4.x or later is used, or adjust as per Safety's latest recommendations.
-
name: Run tests with Pytest
- name: Run tests with Pytest run: pytest
pytest
: Executes Pytest. Pytest will automatically discover test files (e.g.,test_*.py
or*_test.py
) and test functions (prefixed withtest_
) in thetests
directory (as configured inpyproject.toml
). It will also use other configurations frompyproject.toml
, such as code coverage options (--cov=main
,--cov-report=term-missing
, etc.).- Why Pytest? A robust testing framework that integrates well with other tools and has a clean syntax.
Job 2: deploy-and-run-on-ec2
This job is responsible for deploying and running the script on your AWS EC2 instance.
deploy-and-run-on-ec2:
name: Deploy and Run Script on EC2
needs: lint-test-analyze
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
runs-on: ubuntu-latest
deploy-and-run-on-ec2
: The identifier for this job.name: Deploy and Run Script on EC2
: Human-readable name.needs: lint-test-analyze
:- What: Specifies that this job depends on the successful completion of the
lint-test-analyze
job. - Why: Ensures that code is only deployed if all CI checks (linting, testing, security scans) have passed. This is a fundamental principle of safe CI/CD.
- What: Specifies that this job depends on the successful completion of the
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
:- What: A conditional statement that determines if the job should run. It uses GitHub Actions context variables.
github.ref
: The branch or tag ref that triggered the workflow.refs/heads/main
refers to themain
branch.github.event_name
: The name of the event that triggered the workflow (e.g.,push
,pull_request
).
- Why: This condition ensures that this deployment job only runs when there is a direct
push
to themain
branch. It will not run for pull requests, even if they targetmain
(those are handled by the CI job only). This is a standard practice to prevent deployments from feature branches or PRs directly.
- What: A conditional statement that determines if the job should run. It uses GitHub Actions context variables.
runs-on: ubuntu-latest
: Specifies the runner for this job.
Steps within deploy-and-run-on-ec2
:
-
name: Checkout code
- name: Checkout code uses: actions/checkout@v4
- Purpose: Checks out the repository code again. Even though it was checked out in the previous job, each job runs in a fresh environment, so the code needs to be available.
-
name: Deploy to EC2 and Run Script
- name: Deploy to EC2 and Run Script uses: appleboy/[email protected] with: host: ${{ secrets.EC2_HOST }} username: ${{ secrets.EC2_USERNAME }} key: ${{ secrets.EC2_SSH_PRIVATE_KEY }} port: ${{ secrets.EC2_PORT }} # Default is 22 if not set script: | # ... (multi-line shell script) ...
uses: appleboy/[email protected]
: This uses a popular third-party GitHub Action designed to simplify SSH operations.with:
: Provides inputs to thessh-action
.host: ${{ secrets.EC2_HOST }}
: The hostname or IP address of your EC2 instance.username: ${{ secrets.EC2_USERNAME }}
: The username for SSH login.key: ${{ secrets.EC2_SSH_PRIVATE_KEY }}
: The private SSH key for authentication.port: ${{ secrets.EC2_PORT }}
: The SSH port on the EC2 instance. The original instructions haddefault('22')
here, which is good practice; your current file just hasport: ${{ secrets.EC2_PORT }}
which means if the secretEC2_PORT
isn't set, the action might fail or use its own internal default. Using| default('22')
is more robust if port 22 is the common case.- Importance of
secrets
: These values are sensitive and are stored as encrypted secrets in your GitHub repository settings (Settings > Secrets and variables > Actions). This prevents hardcoding credentials in your workflow file.
script: |
: This multi-line block contains the shell commands that will be executed on the EC2 instance via SSH.set -e
: Exits immediately if a command exits with a non-zero status. This ensures the script fails fast if any step in the deployment process goes wrong.export APP_DIR="/opt/my_data_project_src_main"
: Defines an environment variable for the application directory on the EC2 instance. Using/opt/
is a common convention for add-on software packages.sudo mkdir -p $APP_DIR
: Creates the application directory if it doesn't exist.-p
ensures no error if the directory already exists and creates parent directories if needed.sudo
is used because/opt
is typically writable only by root.sudo chown ${{ secrets.EC2_USERNAME }}:${{ secrets.EC2_USERNAME }} $APP_DIR
: Changes the ownership of the application directory to the SSH user. This is crucial so that subsequent commands (likegit clone
andpip install
within a venv) can be run as the SSH user without needingsudo
for every file operation withinAPP_DIR
.cd $APP_DIR
: Navigates into the application directory.- Git Clone/Update Logic:
if [ ! -d ".git" ]; then git clone https://github.com/${{ github.repository }}.git . else git remote set-url origin https://github.com/${{ github.repository }}.git git fetch origin main --prune git reset --hard origin/main git clean -fdx fi
- This block handles both the first-time deployment (cloning the repository) and subsequent updates.
if [ ! -d ".git" ]
: Checks if it's a new deployment by looking for the.git
directory.git clone ... .
: Clones into the current directory (.
).git remote set-url ...
: Ensures the remote URL is correct (good for robustness).git fetch origin main --prune
: Fetches the latest changes from themain
branch of theorigin
remote and removes any remote-tracking branches that no longer exist on the remote (--prune
).git reset --hard origin/main
: Forces the localmain
branch to exactly match the state oforigin/main
, discarding any local changes or commits on the server. This ensures a clean and predictable state.git clean -fdx
: Removes all untracked files (-f
for force,-d
for directories,-x
for ignored files too). This ensures that any artifacts from previous builds or runs are cleaned up, providing a pristine environment.- Idempotency: This set of commands is designed to be idempotent, meaning running it multiple times will achieve the same result as running it once.
- Python Virtual Environment Setup on EC2:
if [ ! -d "venv" ]; then python3 -m venv venv fi source venv/bin/activate
- Creates a virtual environment named
venv
if it doesn't already exist. - Activates the virtual environment. All subsequent
pip
andpython
commands will use this isolated environment.
- Creates a virtual environment named
- Installing Runtime Dependencies on EC2:
pip install --upgrade pip pip install -r requirements.txt
- Upgrades pip within the virtual environment.
- Installs only the runtime dependencies (
pandas
,numpy
) specified inrequirements.txt
. This is efficient as it doesn't install the large set of development tools on the "production-like" EC2 instance.
- Running the Data Processing Script:
python src/main.py
- Executes your main Python script. The
PROJECT_ROOT
logic withinsrc/main.py
will correctly resolve paths relative to$APP_DIR/src/main.py
on the EC2 instance.
- Executes your main Python script. The
- Informational
echo
statements: Provide feedback in the GitHub Actions log about the deployment process and where to find output files on the EC2 instance.
Security and Best Practices in the Workflow:
- GitHub Secrets: Correctly used for sensitive information like EC2 host, username, and SSH key.
- Principle of Least Privilege (Partial): While
sudo
is used for initial directory creation and ownership change, subsequent operations withinAPP_DIR
are performed as the non-root SSH user. The SSH key used should ideally have restricted permissions on the EC2 instance if possible. - Idempotent Deployment Script: The script on EC2 attempts to be idempotent, which is good for re-runnability.
- Dependency Pinning (Implicit):
- For CI,
pip install -e .[dev]
installs dependencies based onpyproject.toml
. If versions there are pinned or have tight ranges, it's good. If they are loose, the CI environment might vary slightly over time. - For CD,
pip install -r requirements.txt
is used. The reproducibility here depends on howrequirements.txt
is managed. If it's generated withpip freeze
from a known good environment, it pins versions. If manually curated with loose versions, the EC2 environment might vary.
- For CI,
- Fail Fast:
set -e
in the deployment script and the fact that GitHub Actions steps fail on non-zero exit codes help in identifying issues quickly.
Potential Improvements or Alternatives:
- EC2 Setup on Runner vs. SSH Action: For very complex EC2 interactions, one might consider using AWS-specific GitHub Actions (e.g., to use AWS CLI, SSM Run Command) rather than a generic SSH action, though
appleboy/ssh-action
is fine for this scale. - Artifact-Based Deployment: Instead of
git clone
on the EC2 instance, the CI job could build a distributable artifact (like a Python wheel or a zip file containing the script and dependencies). This artifact would then be transferred to EC2 and deployed. This can be more robust and faster for deployment. - Configuration Management for EC2: For managing the EC2 instance itself (ensuring Python, Git, etc., are installed), tools like Ansible, Chef, or Puppet, or using pre-baked AMIs (Amazon Machine Images) would be more robust solutions for production environments.
- Containerization (Docker): A very common approach is to containerize the Python application using Docker. The CI pipeline would build a Docker image, push it to a registry (like Docker Hub or AWS ECR), and the CD part would pull and run this image on EC2 (or a container orchestration service like ECS or EKS). This provides greater environment consistency.
- More Specific
runs-on
: Instead ofubuntu-latest
, specifying a particular version likeubuntu-22.04
can provide more stability over time, aslatest
can change and potentially introduce unexpected behavior. - Safety Command Update: As noted in the logs, update
safety check
tosafety scan
or the current recommended command by the Safety tool. - Workflow Reusability: For more complex projects with multiple similar workflows, GitHub Actions offers "Reusable Workflows" or "Composite Actions" to avoid duplication.
- Environment Variables for
APP_DIR
: Instead ofexport APP_DIR
in the script, you could potentially set it as an environment variable for thessh-action
if the action supports it, or pass it as an argument if the script was designed to take it.
This detailed documentation should clarify the purpose, design, and operation of your ci-cd.yml
workflow file. It's a solid foundation for automating your project's lifecycle.