Proposed‐AWS Infra Structure Streamlit Docker & AWS Implementation Plan for VIDA - RutgersGRID/VIDAHub GitHub Wiki

Streamlit Docker & AWS Implementation Plan for rutgersgrid

Overview

This implementation plan outlines how to containerize Streamlit applications from the rutgersgrid GitHub organization and deploy them to AWS in a cost-optimized manner. The approach follows a multi-app container architecture that bundles applications for efficient resource utilization.

Implementation Phases

Phase 1: Docker Configuration (Week 1)

1.1. Template Repository Updates

First, update the template Streamlit repository with Docker support:

Add the following files to your template repository:
- Dockerfile
- .dockerignore
- docker-compose.yml (for local development)
Create standard Dockerfile in the template:

# Base image
FROM python:3.10-slim

# Set working directory
WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY . .

# Expose Streamlit port
EXPOSE 8501

# Health check
HEALTHCHECK --interval=30s --timeout=5s --start-period=30s --retries=3 \
  CMD curl -f http://localhost:8501/_stcore/health || exit 1

# Run the application
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"]

Create .dockerignore file:

.git
.github
.gitignore
__pycache__/
*.py[cod]
*$py.class
.env
.env.local
venv/
.venv/
ENV/

Create docker-compose.yml for local testing:

version: '3'
services:
  streamlit:
    build: .
    ports:
      - "8501:8501"
    volumes:
      - .:/app
    environment:
      - STREAMLIT_SERVER_PORT=8501
      - STREAMLIT_SERVER_ADDRESS=0.0.0.0

Update the README with Docker usage instructions.

1.2. Create Infrastructure Repository

Create a new repository named rutgersgrid-infrastructure to manage AWS resources and deployment:

Initialize with the following structure:

rutgersgrid-infrastructure/
├── .github/
│   └── workflows/
│       ├── deploy.yml
│       └── cost-report.yml
├── terraform/
│   ├── main.tf
│   ├── variables.tf
│   ├── outputs.tf
│   ├── ecs.tf
│   ├── networking.tf
│   ├── alb.tf
│   └── security.tf
├── docker/
│   ├── Dockerfile
│   ├── nginx.conf
│   └── start.sh
├── scripts/
│   ├── bundle-apps.sh
│   └── generate-cost-report.py
└── README.md

Create the multi-app Dockerfile in docker/Dockerfile:

FROM python:3.10-slim

# Install Nginx
RUN apt-get update && apt-get install -y nginx curl && rm -rf /var/lib/apt/lists/*

# Install common Python packages
RUN pip install --no-cache-dir \
    streamlit==1.32.0 \
    pandas==2.2.0 \
    numpy==1.26.3 \
    matplotlib==3.8.2 \
    boto3==1.34.34

# Create app directories
RUN mkdir -p /apps

# Copy Nginx configuration
COPY docker/nginx.conf /etc/nginx/nginx.conf

# Copy startup script
COPY docker/start.sh /start.sh
RUN chmod +x /start.sh

# Expose port for Nginx
EXPOSE 80

# Add a healthcheck
HEALTHCHECK --interval=30s --timeout=5s --start-period=30s --retries=3 \
  CMD curl -f http://localhost/healthz || exit 1

CMD ["/start.sh"]

Create the Nginx configuration in docker/nginx.conf:

worker_processes 1;

events {
    worker_connections 1024;
}

http {
    server {
        listen 80;
        
        # Health check endpoint
        location /healthz {
            access_log off;
            return 200 'OK';
        }
        
        location / {
            return 302 /app1/;
        }
        
        # App locations will be dynamically generated
        # during the bundle process
    }
}

Create the startup script in docker/start.sh:

#!/bin/bash

# Start all Streamlit instances in background
# This will be dynamically populated during build

# Start Nginx in foreground
nginx -g "daemon off;"

Phase 2: AWS Infrastructure Setup (Week 2)

2.1. Define Terraform Configuration

Create the following Terraform files to provision AWS infrastructure:

terraform/main.tf:

provider "aws" {
  region = var.aws_region
}

# Store Terraform state in S3
terraform {
  backend "s3" {
    bucket = "rutgersgrid-terraform-state"
    key    = "streamlit-apps/terraform.tfstate"
    region = "us-east-1"
  }
}

# Resource tagging module
module "tags" {
  source = "./modules/tags"
  
  project     = "rutgersgrid"
  environment = var.environment
}

terraform/ecs.tf:

# ECS Cluster
resource "aws_ecs_cluster" "streamlit_cluster" {
  name = "rutgersgrid-streamlit-cluster"
  
  setting {
    name  = "containerInsights"
    value = "enabled"
  }
  
  tags = module.tags.common_tags
}

# ECS Task Execution Role
resource "aws_iam_role" "ecs_task_execution_role" {
  name = "rutgersgrid-streamlit-execution-role"
  
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          Service = "ecs-tasks.amazonaws.com"
        }
      }
    ]
  })
  
  tags = module.tags.common_tags
}

# Attach policies to the execution role
resource "aws_iam_role_policy_attachment" "ecs_task_execution_role_policy" {
  role       = aws_iam_role.ecs_task_execution_role.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}

# ECS Task Definition
resource "aws_ecs_task_definition" "streamlit_bundle" {
  family                   = "rutgersgrid-streamlit-bundle"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = var.task_cpu
  memory                   = var.task_memory
  execution_role_arn       = aws_iam_role.ecs_task_execution_role.arn
  
  container_definitions = jsonencode([
    {
      name      = "streamlit-bundle"
      image     = "${aws_ecr_repository.streamlit_apps.repository_url}:latest"
      essential = true
      
      portMappings = [
        {
          containerPort = 80
          hostPort      = 80
          protocol      = "tcp"
        }
      ]
      
      logConfiguration = {
        logDriver = "awslogs"
        options = {
          "awslogs-group"         = aws_cloudwatch_log_group.streamlit_logs.name
          "awslogs-region"        = var.aws_region
          "awslogs-stream-prefix" = "streamlit"
        }
      }
    }
  ])
  
  tags = module.tags.common_tags
}

# ECS Service
resource "aws_ecs_service" "streamlit_service" {
  name            = "rutgersgrid-streamlit-service"
  cluster         = aws_ecs_cluster.streamlit_cluster.id
  task_definition = aws_ecs_task_definition.streamlit_bundle.arn
  launch_type     = "FARGATE"
  desired_count   = var.service_desired_count
  
  network_configuration {
    subnets          = aws_subnet.public[*].id
    security_groups  = [aws_security_group.ecs_sg.id]
    assign_public_ip = true
  }
  
  load_balancer {
    target_group_arn = aws_lb_target_group.streamlit_tg.arn
    container_name   = "streamlit-bundle"
    container_port   = 80
  }
  
  tags = module.tags.common_tags
  
  depends_on = [aws_lb_listener.http]
}

# Auto-scaling configuration
resource "aws_appautoscaling_target" "ecs_target" {
  max_capacity       = var.max_capacity
  min_capacity       = var.min_capacity
  resource_id        = "service/${aws_ecs_cluster.streamlit_cluster.name}/${aws_ecs_service.streamlit_service.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

# Scale based on CPU utilization
resource "aws_appautoscaling_policy" "ecs_policy_cpu" {
  name               = "cpu-auto-scaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.ecs_target.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs_target.scalable_dimension
  service_namespace  = aws_appautoscaling_target.ecs_target.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
    target_value = 70.0
  }
}

# Scale during business hours
resource "aws_appautoscaling_scheduled_action" "scale_up_morning" {
  name               = "scale-up-morning"
  service_namespace  = aws_appautoscaling_target.ecs_target.service_namespace
  resource_id        = aws_appautoscaling_target.ecs_target.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs_target.scalable_dimension
  schedule           = "cron(0 8 ? * MON-FRI *)"
  
  scalable_target_action {
    min_capacity = 1
    max_capacity = var.max_capacity
  }
}

# Scale down after hours
resource "aws_appautoscaling_scheduled_action" "scale_down_evening" {
  name               = "scale-down-evening"
  service_namespace  = aws_appautoscaling_target.ecs_target.service_namespace
  resource_id        = aws_appautoscaling_target.ecs_target.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs_target.scalable_dimension
  schedule           = "cron(0 18 ? * MON-FRI *)"
  
  scalable_target_action {
    min_capacity = 0
    max_capacity = 0
  }
}

# CloudWatch Logs
resource "aws_cloudwatch_log_group" "streamlit_logs" {
  name              = "/ecs/rutgersgrid-streamlit"
  retention_in_days = 30
  
  tags = module.tags.common_tags
}

# ECR Repository
resource "aws_ecr_repository" "streamlit_apps" {
  name                 = "rutgersgrid-streamlit-apps"
  image_tag_mutability = "MUTABLE"
  
  image_scanning_configuration {
    scan_on_push = true
  }
  
  tags = module.tags.common_tags
}

# ECR Lifecycle Policy
resource "aws_ecr_lifecycle_policy" "streamlit_apps_lifecycle" {
  repository = aws_ecr_repository.streamlit_apps.name
  
  policy = jsonencode({
    rules = [
      {
        rulePriority = 1
        description  = "Keep only the 10 most recent images"
        selection = {
          tagStatus     = "any"
          countType     = "imageCountMoreThan"
          countNumber   = 10
        }
        action = {
          type = "expire"
        }
      }
    ]
  })
}

Create additional Terraform files for networking, ALB, security groups, etc.

2.2. GitHub Actions Workflows

Create GitHub Actions workflow files:

.github/workflows/deploy.yml:

name: Deploy Streamlit Apps

on:
  push:
    branches: [ main ]
    paths-ignore:
      - '*.md'
      - 'docs/**'
  workflow_dispatch:

permissions:
  id-token: write
  contents: read

jobs:
  discover-apps:
    runs-on: ubuntu-latest
    outputs:
      app_list: ${{ steps.find-apps.outputs.app_list }}
    steps:
      - name: Checkout Infrastructure
        uses: actions/checkout@v3
        with:
          fetch-depth: 1
      
      - name: Find Streamlit Apps
        id: find-apps
        run: |
          # Get list of Streamlit app repositories from rutgersgrid organization
          APPS=$(gh api orgs/rutgersgrid/repos --jq '[.[] | select(.name | startswith("streamlit-")) | .name]')
          echo "app_list=$APPS" >> $GITHUB_OUTPUT
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

  build-and-deploy:
    runs-on: ubuntu-latest
    needs: discover-apps
    steps:
      - name: Checkout Infrastructure
        uses: actions/checkout@v3
        
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v2
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: ${{ secrets.AWS_REGION }}
      
      - name: Login to Amazon ECR
        id: login-ecr
        uses: aws-actions/amazon-ecr-login@v1
      
      - name: Clone App Repositories
        run: |
          mkdir -p apps
          APP_LIST='${{ needs.discover-apps.outputs.app_list }}'
          for app in $(echo $APP_LIST | jq -r '.[]'); do
            echo "Cloning $app"
            git clone https://github.com/rutgersgrid/$app.git apps/$app
          done
      
      - name: Generate App Configs
        run: |
          # Generate Nginx config entries
          PORT=8501
          NGINX_LOCATIONS=""
          START_COMMANDS=""
          
          for app_dir in apps/*; do
            if [ -d "$app_dir" ]; then
              app_name=$(basename $app_dir)
              echo "Configuring $app_name on port $PORT"
              
              # Add to Nginx config
              NGINX_LOCATIONS+="
        location /$app_name/ {
            proxy_pass http://localhost:$PORT/;
            proxy_http_version 1.1;
            proxy_set_header Upgrade \$http_upgrade;
            proxy_set_header Connection \"upgrade\";
            proxy_set_header Host \$host;
            proxy_cache_bypass \$http_upgrade;
        }"
              
              # Add to startup commands
              START_COMMANDS+="cd /apps/$app_name && streamlit run app.py --server.port=$PORT --server.baseUrlPath=/$app_name --server.enableCORS=false --server.enableXsrfProtection=false &\n"
              
              # Increment port for next app
              PORT=$((PORT+1))
            fi
          done
          
          # Update Nginx config
          sed -i "/# App locations will be dynamically generated/a\\$NGINX_LOCATIONS" docker/nginx.conf
          
          # Update startup script
          sed -i "/# This will be dynamically populated during build/a\\$START_COMMANDS" docker/start.sh
      
      - name: Build Multi-App Docker Image
        run: |
          docker build \
            -t ${{ steps.login-ecr.outputs.registry }}/rutgersgrid-streamlit-apps:latest \
            -t ${{ steps.login-ecr.outputs.registry }}/rutgersgrid-streamlit-apps:${{ github.sha }} \
            -f docker/Dockerfile .
      
      - name: Push Docker Image
        run: |
          docker push ${{ steps.login-ecr.outputs.registry }}/rutgersgrid-streamlit-apps:latest
          docker push ${{ steps.login-ecr.outputs.registry }}/rutgersgrid-streamlit-apps:${{ github.sha }}
      
      - name: Deploy to ECS
        run: |
          aws ecs update-service \
            --cluster rutgersgrid-streamlit-cluster \
            --service rutgersgrid-streamlit-service \
            --force-new-deployment
      
      - name: Wait for Deployment
        run: |
          aws ecs wait services-stable \
            --cluster rutgersgrid-streamlit-cluster \
            --services rutgersgrid-streamlit-service
      
      - name: Output Service URL
        run: |
          ALB_DNS=$(aws elbv2 describe-load-balancers \
            --names rutgersgrid-streamlit-alb \
            --query 'LoadBalancers[0].DNSName' \
            --output text)
          
          echo "Streamlit applications deployed to: http://$ALB_DNS/"

.github/workflows/cost-report.yml:

name: Generate Cost Report

on:
  schedule:
    - cron: '0 0 * * MON'  # Run every Monday
  workflow_dispatch:

permissions:
  id-token: write
  contents: read

jobs:
  cost-analysis:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v3
      
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v2
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: ${{ secrets.AWS_REGION }}
      
      - name: Generate Cost Report
        run: |
          python scripts/generate-cost-report.py > cost-report.md
      
      - name: Upload Cost Report
        uses: actions/upload-artifact@v3
        with:
          name: cost-report
          path: cost-report.md

Phase 3: CI/CD Integration (Week 3)

3.1. Create App Registration System

Develop a system to register Streamlit apps for deployment:

Add a release-metadata.json file to each Streamlit app repository:

{
  "name": "app-name",
  "type": "streamlit",
  "description": "Description of the app",
  "route": "/app-name/",
  "version": "1.0.0",
  "tags": ["tag1", "tag2"]
}

Create a script to gather this metadata during deployment.

3.2. Update Template Repository with CI/CD

Add GitHub Actions workflow to the template repository:

name: CI/CD Pipeline

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'
      
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
          pip install pytest
      
      - name: Run tests
        run: |
          pytest
  
  build:
    needs: test
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Build Docker image
        run: |
          docker build -t streamlit-app:${{ github.sha }} .
      
      - name: Test Docker image
        run: |
          docker run -d -p 8501:8501 --name test-app streamlit-app:${{ github.sha }}
          sleep 10
          curl -s http://localhost:8501/_stcore/health
      
      - name: Trigger infrastructure update
        uses: peter-evans/repository-dispatch@v2
        with:
          token: ${{ secrets.REPO_ACCESS_TOKEN }}
          repository: rutgersgrid/rutgersgrid-infrastructure
          event-type: app-updated
          client-payload: '{"app_name": "${{ github.repository }}", "commit_sha": "${{ github.sha }}"}'

3.3. Cost Optimization Script

Create a script to analyze costs and provide optimization recommendations:

#!/usr/bin/env python3
"""
AWS Cost Analysis Script for rutgersgrid Streamlit Applications
"""

import boto3
import datetime
import json
from tabulate import tabulate

def get_date_range():
    """Get date range for the past 30 days"""
    end_date = datetime.datetime.now()
    start_date = end_date - datetime.timedelta(days=30)
    
    return start_date.strftime('%Y-%m-%d'), end_date.strftime('%Y-%m-%d')

def analyze_costs():
    """Analyze AWS costs for the past 30 days"""
    client = boto3.client('ce')
    
    start_date, end_date = get_date_range()
    
    # Get costs grouped by service
    response = client.get_cost_and_usage(
        TimePeriod={
            'Start': start_date,
            'End': end_date
        },
        Granularity='MONTHLY',
        Metrics=['UnblendedCost'],
        GroupBy=[
            {
                'Type': 'DIMENSION',
                'Key': 'SERVICE'
            }
        ],
        Filter={
            'Tags': {
                'Key': 'Project',
                'Values': ['rutgersgrid']
            }
        }
    )
    
    # Process results
    cost_by_service = []
    total_cost = 0.0
    
    for result in response['ResultsByTime'][0]['Groups']:
        service = result['Keys'][0]
        amount = float(result['Metrics']['UnblendedCost']['Amount'])
        total_cost += amount
        
        cost_by_service.append([service, f"${amount:.2f}"])
    
    # Sort by cost (descending)
    cost_by_service.sort(key=lambda x: float(x[1].replace('$', '')), reverse=True)
    
    # Add total
    cost_by_service.append(['TOTAL', f"${total_cost:.2f}"])
    
    # Generate markdown report
    print(f"# Rutgersgrid Streamlit Cost Report\n")
    print(f"## Cost Analysis ({start_date} to {end_date})\n")
    print(tabulate(cost_by_service, headers=['Service', 'Cost'], tablefmt='pipe'))
    
    # Get daily trend
    daily_response = client.get_cost_and_usage(
        TimePeriod={
            'Start': start_date,
            'End': end_date
        },
        Granularity='DAILY',
        Metrics=['UnblendedCost'],
        Filter={
            'Tags': {
                'Key': 'Project',
                'Values': ['rutgersgrid']
            }
        }
    )
    
    # Generate daily cost data for trends
    days = []
    costs = []
    
    for result in daily_response['ResultsByTime']:
        day = result['TimePeriod']['Start']
        cost = float(result['Total']['UnblendedCost']['Amount'])
        days.append(day)
        costs.append(cost)
    
    print("\n## Daily Cost Trend\n")
    
    # Calculate monthly forecast
    days_in_month = 30
    days_passed = len(days)
    forecast = total_cost * (days_in_month / days_passed)
    
    print(f"\n## Monthly Forecast\n")
    print(f"Projected monthly cost: ${forecast:.2f}\n")
    
    # Optimization recommendations
    print("\n## Optimization Recommendations\n")
    if total_cost > 100:
        print("- Consider implementing application bundling to reduce container costs")
        print("- Verify the auto-scaling settings to ensure instances scale to zero after hours")
    
    print("- Review CloudWatch logs and metrics for optimization opportunities")
    print("- Consider using Fargate Spot for non-critical workloads")

if __name__ == "__main__":
    analyze_costs()

Phase 4: Testing and Deployment (Week 4)

4.1. Local Testing

Test the containerization locally:

Build individual app containers
Test multi-app container with Nginx routing
Verify auto-scaling configuration

4.2. AWS Deployment

Deploy to AWS:

Create AWS resources using Terraform
Deploy the initial container
Set up monitoring and alerting
Test full deployment pipeline

4.3. Documentation

Update documentation:

Add Docker usage instructions to each repository
Create deployment guide
Document cost optimization strategy
Create troubleshooting guide

Cost Optimization Measures

Based on your cost analysis documents, implement these optimizations:

Application Bundling: Bundle multiple Streamlit apps in a single container
Auto-Scaling: Scale to zero during off-hours (evenings and weekends)
Resource Sizing: Start with minimal resources (0.25 vCPU, 0.5GB RAM)
Use Fargate Spot: For non-critical workloads to save ~70%
Shared Resources: Single ALB for all applications
CloudFront Integration: For static asset caching

Estimated AWS Costs

Based on your "Annual Cost Analysis" document:

Scale	Monthly Cost	Annual Cost
1-5 Apps	$28-40	$336-480
10 Apps	~$89	~$1,070
Fully Optimized (10 Apps)	~$52	~$624

Cost optimization can provide savings of 30-66% depending on scale.

Next Steps

After initial implementation:

Monitor Usage: Track application usage patterns
Refine Scaling: Adjust auto-scaling based on actual usage
Cost Tracking: Implement detailed cost allocation tagging
Performance Optimization: Monitor and optimize performance

Streamlit Docker & AWS Implementation - User Stories

Epic 1: Docker Configuration and Containerization

As a developer, I want to have a standardized Docker template for Streamlit applications, so that I can consistently containerize any Streamlit app in the VIDA project.
As a developer, I want to create a multi-app Docker container architecture with Nginx, so that I can efficiently bundle multiple Streamlit applications into a single deployment.
As a DevOps engineer, I want to set up a central infrastructure repository, so that all deployment configurations and scripts are maintained in one place.

Epic 2: AWS Infrastructure Setup

As a cloud architect, I want to define AWS infrastructure using Terraform, so that cloud resources can be provisioned in a controlled, version-controlled manner.
As a system administrator, I want to configure auto-scaling for ECS services, so that resources scale based on demand and reduce costs during off-hours.
As a DevOps engineer, I want to set up a secure ECS cluster with appropriate IAM roles, so that the container environment follows security best practices.
As a system administrator, I want to configure a load balancer for the containerized applications, so that traffic is properly distributed and the system is resilient.

Epic 3: CI/CD Integration

As a developer, I want to implement an app registration system, so that new Streamlit applications can be easily added to the deployment pipeline.
As a DevOps engineer, I want to create GitHub Actions workflows for automated deployment, so that changes to applications trigger appropriate infrastructure updates.
As a project manager, I want to implement a cost analysis and reporting system, so that AWS expenses are tracked and optimization opportunities are identified.
As a developer, I want to update the template repository with CI/CD configuration, so that new applications automatically integrate with the deployment pipeline.

Epic 4: Testing and Deployment

As a QA engineer, I want to test the container architecture locally, so that issues can be identified before cloud deployment.
As a DevOps engineer, I want to deploy the complete solution to AWS, so that the infrastructure is ready for production use.
As a technical writer, I want to create comprehensive documentation for the system, so that team members can understand, use, and maintain the deployment infrastructure.
As a system administrator, I want to set up monitoring and alerting, so that issues can be quickly identified and addressed.

Epic 5: Cost Optimization

As a finance manager, I want to implement recommended cost optimization measures, so that cloud expenses are minimized without compromising performance.
As a system administrator, I want to implement CloudFront for static asset caching, so that delivery of static content is optimized for performance and cost.
As a DevOps engineer, I want to set up detailed cost allocation tagging, so that expenses can be tracked at a granular level.