AWS Infrastructure Port Phase 46 Design - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki

AWS Infrastructure Port — Phase 46 Design Specification

MDC MCP RAG Server: Docker → AWS-Native Migration

Date: March 20, 2026 | Branch: develop_aws | Spec: .kiro/specs/aws-infrastructure-port/


Executive Summary

This document consolidates the full design specification for porting the MDC MCP RAG Server from its legacy Docker-based infrastructure on NOAA Parallel Works VMs to AWS-native services. The system provides 51 MCP tools across 9 modules for NOAA Global Workflow AI assistance, backed by ChromaDB (~81K documents, 5 collections) and Neo4j (~95K nodes, ~2.6M relationships).

The migration replaces Docker Compose orchestration with:

Legacy Component AWS Replacement
Neo4j (Docker) Amazon Neptune (openCypher)
ChromaDB (Docker) Amazon OpenSearch (k-NN vector search)
Docker MCP Gateway ECS Fargate + API Gateway
Docker Compose AWS CDK (TypeScript IaC)
Manual TLS/Auth CloudFront + WAF + Cognito OAuth 2.0

All new infrastructure uses mdc-mcp-rag naming per the EIB → MDC institutional rename. The persistent data root shifts from /mcp_rag_eib to /mdc-mcp-rag.


Table of Contents


Architecture

Target AWS Service Topology

┌────────────────────────────────────────────────────────────────┐
│  VPC: mdc-mcp-rag-vpc                                          │
│                                                                │
│  ┌─── Public Subnets ───┐    ┌─── Private Subnets ──────────┐  │
│  │  ALB                 │    │  ECS Fargate (MCP Server)    │  │
│  │  NAT Gateway         │    │  51 tools, 9 modules         │  │
│  └──────────────────────┘    │                              │  │
│                              │  Amazon Neptune              │  │
│  ┌─── VPC Endpoints ────┐    │  openCypher, ~95K nodes      │  │
│  │  Secrets Manager     │    │                              │  │
│  │  SSM Parameter Store │    │  Amazon OpenSearch           │  │
│  │  CloudWatch          │    │  k-NN, ~81K docs, 768-dim    │  │
│  └──────────────────────┘    └──────────────────────────────┘  │
└────────────────────────────────────────────────────────────────┘
         ▲                              ▲
         │                              │
  CloudFront + WAF              Amazon EFS (/mdc-mcp-rag)
         ▲
         │
  Amazon Cognito (OAuth 2.0)

Legacy → AWS Component Mapping

Legacy Component Provisioning Script AWS Replacement Phase
Directory structure (/mcp_rag_eib) 01-directories.sh EFS mount at /mdc-mcp-rag 46A
System dependencies 02-system-deps.sh Amazon Linux 2 AMI + Dockerfile 46A
Docker engine 03-docker.sh Not needed (ECS Fargate) Eliminated
Node.js runtime 04-nodejs.sh Dockerfile base image 46B
Python & Spack 05-python-spack.sh Python in container / pip 46D
ChromaDB (Docker) 06-chromadb.sh Amazon OpenSearch 46C
MCP Server (Node.js) 07-mcp-server.sh ECS Fargate service 46B
Neo4j + n8n (Docker Compose) 08-services.sh Amazon Neptune 46C
Desktop VNC 09-desktop-vnc.sh Not needed (Kiro IDE) Eliminated
Health checks 10-verification.sh CloudWatch + custom health endpoint 46E
Docker MCP Gateway 11-docker-mcp-gateway.sh API Gateway (Streamable HTTP) 46B
Static mode gateway 12-static-mode-gateway.sh CloudFront + ALB routing 46B
Container cleanup 13-container-cleanup.sh ECS task lifecycle (automatic) Eliminated
File permissions 14-final-ownership.sh IAM task roles + EFS POSIX 46A
GitHub Copilot CLI 15-github-copilot-cli.sh Not needed (Kiro IDE) Eliminated

Requirements Summary

17 Requirements Across 6 Domains

Database Abstraction (R1–R2)

  • Adapter interfaces (VectorDatabaseAdapter, GraphDatabaseAdapter) abstract all 51 tools from backend specifics
  • APOC procedure calls transparently replaced with openCypher equivalents for Neptune compatibility
  • Backend selection via DB_BACKEND env var or SSM parameter

Tool Interface Preservation (R3)

  • All 51 tools across 9 modules expose identical input schemas and output formats on AWS
  • Streamable HTTP transport via API Gateway replaces Docker MCP Gateway on port 18888

Data Migration & Fidelity (R4–R6)

  • Complete graph + vector data migration with count parity verification
  • 768-dim MPNet embeddings preserved exactly (no re-embedding)
  • Search relevance within 5% tolerance (epsilon = 0.05) between OpenSearch and ChromaDB

Infrastructure as Code (R7–R10)

  • Four CDK stacks: MdcVpcStack, MdcDataStack, MdcSecurityStack, MdcServerStack
  • Secrets in AWS Secrets Manager, config in SSM Parameter Store
  • ECS Fargate with auto-scaling, ALB health checks
  • CloudFront + WAF + Cognito OAuth 2.0

Security & Resilience (R11–R14)

  • All databases in private subnets, VPC endpoints for AWS services
  • KMS encryption at rest, TLS 1.2+ in transit
  • Graceful degradation: graph-dependent tools degrade independently of vector-search tools
  • Exponential backoff retries (5s, 10s, 20s, max 60s)

Operations (R15–R17)

  • 7 ingestion scripts adapted for Neptune + OpenSearch
  • Phased rollout with legacy coexistence during migration
  • 5 OpenSearch indices mapped from 5 ChromaDB collections

Design Components

Component 1: Database Adapter Layer

The critical migration seam. Abstracts database access so tool modules work identically against legacy or AWS backends.

VectorDatabaseAdapter interface:

  • connect(), query(), multiCollectionQuery(), addDocuments()
  • listCollections(), getCollectionCount(), healthCheck(), close()

GraphDatabaseAdapter interface:

  • connect(), query(), findCallers(), traceCallChain()
  • getStatistics(), healthCheck(), close()

Implementations:

  • OpenSearchAdapter — AWS backend (k-NN search, AWS Sig V4 auth)
  • ChromaDBLegacyAdapter — wraps existing VectorDatabase.js
  • NeptuneAdapter — AWS backend (openCypher, IAM auth, APOC transformation)
  • Neo4jLegacyAdapter — wraps existing GraphDatabase.js

Component 2: APOC Transformation Engine

Neptune does not support APOC procedures. The transformation engine transparently rewrites queries:

APOC Procedure openCypher Replacement
apoc.path.expand Variable-length path patterns
apoc.algo.dijkstra Neptune shortest path / Gremlin
apoc.periodic.iterate Batched UNWIND queries
apoc.create.node Standard CREATE
apoc.merge.node MERGE with ON CREATE SET / ON MATCH SET

Unknown APOC procedures throw UnsupportedQueryError.

Component 3: MCP Server Container (ECS Fargate)

  • Runs UnifiedMCPServer.js in full scenario mode
  • 1 vCPU, 2GB memory, minimum 1 task (avoids cold starts)
  • Auto-scales based on request volume
  • Connects to Neptune + OpenSearch via VPC private networking
  • Pulls secrets from Secrets Manager at startup via task IAM role

Component 4: API Gateway + CloudFront Layer

  • CloudFront distribution with WAF (rate limiting, geo-restriction, SQL injection protection)
  • API Gateway routes /mcp to ECS Fargate via ALB
  • Cognito user pool for OAuth 2.0 token validation (RFC 9728)
  • Protected Resource Metadata endpoint for MCP client discovery

Component 5: CDK Infrastructure Stacks

Stack Resources Dependencies
MdcVpcStack VPC, subnets, NAT Gateway, VPC endpoints None
MdcDataStack Neptune, OpenSearch, EFS, S3 VpcStack
MdcSecurityStack Cognito, WAF, Secrets Manager, IAM roles VpcStack
MdcServerStack ECS, Fargate, ALB, API Gateway, CloudFront All above

Component 6: Configuration & Secrets Management

Legacy Config AWS Service Key Path
NEO4J_PASSWORD Secrets Manager mdc-mcp-rag/neptune/credentials
CHROMADB_URL SSM Parameter Store /mdc-mcp-rag/opensearch/endpoint
NEO4J_URI SSM Parameter Store /mdc-mcp-rag/neptune/endpoint
GITHUB_TOKEN Secrets Manager mdc-mcp-rag/github/token
Auth tokens Cognito User pool client credentials

Data Models

OpenSearch Index Design (replacing ChromaDB)

ChromaDB Collection OpenSearch Index Documents Notes
code-with-context-v8-0-0 mdc-code-context ~58,761 Largest; Python, Fortran, Shell source
global-workflow-docs-v8-0-0 mdc-workflow-docs ~3,514 Documentation, READMEs
jjobs-v8-0-0 mdc-jjobs ~700 J-Job scripts with structured metadata
community-summaries mdc-community-summaries ~828 Hierarchical community embeddings (4 levels)
ee2-standards-v5-0-0-enhanced mdc-ee2-standards ~34 EE2/NCO compliance standards

Index mapping: embedding (knn_vector, 768-dim, nmslib, cosinesimil, hnsw), content (text), metadata (object), source_file (keyword), chunk_id (keyword), collection_name (keyword)

Neptune Graph Schema (replacing Neo4j)

  • 28 node labels preserved: FortranSubroutine, FortranFunction, FortranModule, PythonFunction, ShellScript, ShellFunction, EnvironmentVariable, Community, etc.
  • 23 relationship types preserved: CALLS, USES, DEFINES, IMPORTS, DEPENDS_ON_ENV, SOURCES, INVOKES, EXECUTES, MEMBER_OF, PARENT_OF, INTERACTS_WITH, etc.
  • Pre-computed communities stored as nodes (materialized in Phase 24E-5)
  • Bolt-compatible endpoint for openCypher queries

Algorithms

Backend Selection

  1. Read DB_BACKEND from SSM parameter /mdc-mcp-rag/db-backend or environment variable
  2. "aws" → instantiate OpenSearch + Neptune adapters
  3. "legacy" → instantiate ChromaDB + Neo4j adapters
  4. Unknown → descriptive error
  5. Connect both adapters, verify health checks pass
  6. Return adapters to UnifiedDataAccess (transparent to all 51 tools)

OpenSearch Vector Query

  1. Generate 768-dim MPNet embedding from query text
  2. Build k-NN search body with optional metadata filters (bool query + filter clause)
  3. Execute against OpenSearch index
  4. Normalize cosine similarity scores to [0, 1]
  5. Return results in _formatQueryResults() compatible format

Neptune openCypher Query Adapter

  1. Scan query for APOC procedure calls
  2. Transform each known APOC call to openCypher equivalent
  3. Throw UnsupportedQueryError for unknown APOC procedures
  4. Execute transformed query via Neptune bolt endpoint
  5. Convert Neptune records to Neo4j _recordToObject() compatible format

Data Migration

  1. Export Neo4j graph dump → S3 staging bucket
  2. Export ChromaDB collections (embeddings + metadata + content) → S3
  3. Neptune bulk loader imports graph from S3
  4. OpenSearch bulk index imports vectors per collection
  5. Verify: node count, relationship count, document count per index all match legacy

Correctness Properties

13 formal properties validated through property-based testing (fast-check):

# Property Validates
P1 Tool Interface Preservation — output JSON schema identical between backends R3.2
P2 Adapter Output Compatibility — query output format matches legacy R1.6, R1.7
P3 APOC Transformation Semantic Preservation R2.7
P4 Data Completeness — node/rel/doc counts match after migration R4.5–4.7
P5 Migration Idempotence — re-run produces identical state R4.8
P6 Embedding Fidelity — 768-dim vectors bitwise identical after migration R5.1
P7 Score Normalization — all cosine similarity scores in [0, 1] R5.3
P8 Search Equivalence — ranking within 5% tolerance R6.1–6.3
P9 Health Check Accuracy — correct healthy/degraded reporting R11.1–11.2
P10 Graceful Degradation — unaffected tools continue working R11.3, R14.1–14.2
P11 Secret Non-Exposure — no secrets in logs, outputs, or env dumps R8.5–8.6
P12 Configuration Caching — single API call per key per process R8.3
P13 Retry Exponential Backoff — 5s, 10s, 20s, max 60s R14.4

Error Handling

Scenario Response Recovery
Neptune unreachable Graph tools degraded; filesystem + vector tools continue CloudWatch alarm; exponential backoff retry
OpenSearch index missing Empty results with warning; health reports degraded Re-run migration for specific index
Unknown APOC procedure UnsupportedQueryError with procedure name Add to replacement map or implement Gremlin fallback
Secrets Manager throttled Use cached secrets; fall back to env vars with warning Secrets cached 5 min; VPC endpoint avoids internet
Migration partial failure Idempotent re-execution from last watermark Re-run script; verification asserts count parity

Implementation Plan

5 Sub-Phases (46A–46E), 17 Task Groups

Phase 46A: Foundation (Week 1–2)

  • CDK project scaffolding (MdcVpcStack, MdcSecurityStack, MdcDataStack)
  • VPC with public/private subnets, NAT Gateway, VPC endpoints
  • Neptune cluster, OpenSearch domain, EFS, S3 staging bucket
  • Secrets Manager + SSM Parameter Store entries
  • resolveConfig() with caching and fallback

Phase 46B: MCP Server on ECS (Week 2–3)

  • Database adapter interfaces and implementations (OpenSearch, Neptune, legacy wrappers)
  • APOC transformation engine
  • Backend selection and UnifiedDataAccess wiring
  • Dockerfile, MdcServerStack (ECS, Fargate, ALB, API Gateway, CloudFront)
  • Health check and error handling with graceful degradation

Phase 46C: Database Migration (Week 3–5)

  • OpenSearch index creation (5 indices with k-NN mappings)
  • Data migration script (Neo4j → Neptune, ChromaDB → OpenSearch)
  • Migration verification (count parity)
  • Search relevance validation (5% tolerance)

Phase 46D: Ingestion Pipeline Adaptation (Week 5–6)

  • Adapt 7 ingestion scripts for Neptune (bolt/openCypher) and OpenSearch (bulk API)
  • Preserve MPNet embedding model (768-dim)

Phase 46E: Validation & Cutover (Week 6–7)

  • CloudWatch dashboards and alarms
  • Full 51-tool integration tests against AWS backends
  • MCP client configuration cutover
  • Legacy system kept as read-only fallback for 2 weeks

Cost Estimate

Service Configuration Monthly Cost
Neptune Serverless 1–8 NCU, openCypher ~$50–200
OpenSearch Serverless 2 OCU (search + index) ~$350
ECS Fargate 1 vCPU, 2GB, min 1 task ~$36
CloudFront Moderate traffic ~$50
ALB 1 ALB ~$17
Cognito <50K MAU (free tier) $0
Secrets Manager 5 secrets ~$2
EFS 10GB ~$3
NAT Gateway 1 AZ ~$37
Total ~$545–745/month

Performance Targets

Metric Legacy (Docker) AWS Target
Vector query latency ~50ms (ChromaDB local) ~100–200ms (OpenSearch)
Graph query latency ~20ms (Neo4j local) ~50–100ms (Neptune)
MCP request E2E ~200ms (stdio local) ~500ms (HTTP + auth)
Startup time ~5s (npm start) ~30s (Fargate cold start)
Data migration N/A ~2–4 hours (one-time)

Security Summary

  • External access: OAuth 2.0 via Amazon Cognito (RFC 9728)
  • Internal (VPC): IAM roles for ECS tasks → Neptune, OpenSearch, Secrets Manager
  • Neptune: IAM authentication (no username/password)
  • Network: All databases in private subnets, VPC endpoints for AWS services
  • Encryption: KMS at rest (Neptune, OpenSearch, EFS, S3), TLS 1.2+ in transit
  • WAF: Rate limiting, geo-restriction, SQL injection protection on CloudFront
  • No secrets in CDK outputs, environment variables, or container logs

Dependencies

Dependency Version Purpose
AWS CDK v2.x Infrastructure as code
@opensearch-project/opensearch ^2.x OpenSearch Node.js client
neo4j-driver ^5.x Neptune bolt protocol (openCypher)
@aws-sdk/client-secrets-manager latest Secrets retrieval
@aws-sdk/client-ssm latest Parameter Store access
@modelcontextprotocol/sdk existing MCP protocol (unchanged)
sentence-transformers existing MPNet embedding model (unchanged)
fast-check ^3.x Property-based testing

Generated from Kiro spec artifacts in .kiro/specs/aws-infrastructure-port/ (requirements.md, design.md, tasks.md) on March 20, 2026.