AWS Infrastructure Port Phase 46 Design - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki
AWS Infrastructure Port — Phase 46 Design Specification
MDC MCP RAG Server: Docker → AWS-Native Migration
Date: March 20, 2026 | Branch: develop_aws | Spec: .kiro/specs/aws-infrastructure-port/
Executive Summary
This document consolidates the full design specification for porting the MDC MCP RAG Server from its legacy Docker-based infrastructure on NOAA Parallel Works VMs to AWS-native services. The system provides 51 MCP tools across 9 modules for NOAA Global Workflow AI assistance, backed by ChromaDB (~81K documents, 5 collections) and Neo4j (~95K nodes, ~2.6M relationships).
The migration replaces Docker Compose orchestration with:
| Legacy Component | AWS Replacement |
|---|---|
| Neo4j (Docker) | Amazon Neptune (openCypher) |
| ChromaDB (Docker) | Amazon OpenSearch (k-NN vector search) |
| Docker MCP Gateway | ECS Fargate + API Gateway |
| Docker Compose | AWS CDK (TypeScript IaC) |
| Manual TLS/Auth | CloudFront + WAF + Cognito OAuth 2.0 |
All new infrastructure uses mdc-mcp-rag naming per the EIB → MDC institutional rename. The persistent data root shifts from /mcp_rag_eib to /mdc-mcp-rag.
Table of Contents
- Architecture
- Requirements Summary
- Design Components
- Data Models
- Algorithms
- Correctness Properties
- Error Handling
- Implementation Plan
- Cost Estimate
- Phased Rollout
Architecture
Target AWS Service Topology
┌────────────────────────────────────────────────────────────────┐
│ VPC: mdc-mcp-rag-vpc │
│ │
│ ┌─── Public Subnets ───┐ ┌─── Private Subnets ──────────┐ │
│ │ ALB │ │ ECS Fargate (MCP Server) │ │
│ │ NAT Gateway │ │ 51 tools, 9 modules │ │
│ └──────────────────────┘ │ │ │
│ │ Amazon Neptune │ │
│ ┌─── VPC Endpoints ────┐ │ openCypher, ~95K nodes │ │
│ │ Secrets Manager │ │ │ │
│ │ SSM Parameter Store │ │ Amazon OpenSearch │ │
│ │ CloudWatch │ │ k-NN, ~81K docs, 768-dim │ │
│ └──────────────────────┘ └──────────────────────────────┘ │
└────────────────────────────────────────────────────────────────┘
▲ ▲
│ │
CloudFront + WAF Amazon EFS (/mdc-mcp-rag)
▲
│
Amazon Cognito (OAuth 2.0)
Legacy → AWS Component Mapping
| Legacy Component | Provisioning Script | AWS Replacement | Phase |
|---|---|---|---|
Directory structure (/mcp_rag_eib) |
01-directories.sh |
EFS mount at /mdc-mcp-rag |
46A |
| System dependencies | 02-system-deps.sh |
Amazon Linux 2 AMI + Dockerfile | 46A |
| Docker engine | 03-docker.sh |
Not needed (ECS Fargate) | Eliminated |
| Node.js runtime | 04-nodejs.sh |
Dockerfile base image | 46B |
| Python & Spack | 05-python-spack.sh |
Python in container / pip | 46D |
| ChromaDB (Docker) | 06-chromadb.sh |
Amazon OpenSearch | 46C |
| MCP Server (Node.js) | 07-mcp-server.sh |
ECS Fargate service | 46B |
| Neo4j + n8n (Docker Compose) | 08-services.sh |
Amazon Neptune | 46C |
| Desktop VNC | 09-desktop-vnc.sh |
Not needed (Kiro IDE) | Eliminated |
| Health checks | 10-verification.sh |
CloudWatch + custom health endpoint | 46E |
| Docker MCP Gateway | 11-docker-mcp-gateway.sh |
API Gateway (Streamable HTTP) | 46B |
| Static mode gateway | 12-static-mode-gateway.sh |
CloudFront + ALB routing | 46B |
| Container cleanup | 13-container-cleanup.sh |
ECS task lifecycle (automatic) | Eliminated |
| File permissions | 14-final-ownership.sh |
IAM task roles + EFS POSIX | 46A |
| GitHub Copilot CLI | 15-github-copilot-cli.sh |
Not needed (Kiro IDE) | Eliminated |
Requirements Summary
17 Requirements Across 6 Domains
Database Abstraction (R1–R2)
- Adapter interfaces (
VectorDatabaseAdapter,GraphDatabaseAdapter) abstract all 51 tools from backend specifics - APOC procedure calls transparently replaced with openCypher equivalents for Neptune compatibility
- Backend selection via
DB_BACKENDenv var or SSM parameter
Tool Interface Preservation (R3)
- All 51 tools across 9 modules expose identical input schemas and output formats on AWS
- Streamable HTTP transport via API Gateway replaces Docker MCP Gateway on port 18888
Data Migration & Fidelity (R4–R6)
- Complete graph + vector data migration with count parity verification
- 768-dim MPNet embeddings preserved exactly (no re-embedding)
- Search relevance within 5% tolerance (epsilon = 0.05) between OpenSearch and ChromaDB
Infrastructure as Code (R7–R10)
- Four CDK stacks:
MdcVpcStack,MdcDataStack,MdcSecurityStack,MdcServerStack - Secrets in AWS Secrets Manager, config in SSM Parameter Store
- ECS Fargate with auto-scaling, ALB health checks
- CloudFront + WAF + Cognito OAuth 2.0
Security & Resilience (R11–R14)
- All databases in private subnets, VPC endpoints for AWS services
- KMS encryption at rest, TLS 1.2+ in transit
- Graceful degradation: graph-dependent tools degrade independently of vector-search tools
- Exponential backoff retries (5s, 10s, 20s, max 60s)
Operations (R15–R17)
- 7 ingestion scripts adapted for Neptune + OpenSearch
- Phased rollout with legacy coexistence during migration
- 5 OpenSearch indices mapped from 5 ChromaDB collections
Design Components
Component 1: Database Adapter Layer
The critical migration seam. Abstracts database access so tool modules work identically against legacy or AWS backends.
VectorDatabaseAdapter interface:
connect(),query(),multiCollectionQuery(),addDocuments()listCollections(),getCollectionCount(),healthCheck(),close()
GraphDatabaseAdapter interface:
connect(),query(),findCallers(),traceCallChain()getStatistics(),healthCheck(),close()
Implementations:
OpenSearchAdapter— AWS backend (k-NN search, AWS Sig V4 auth)ChromaDBLegacyAdapter— wraps existingVectorDatabase.jsNeptuneAdapter— AWS backend (openCypher, IAM auth, APOC transformation)Neo4jLegacyAdapter— wraps existingGraphDatabase.js
Component 2: APOC Transformation Engine
Neptune does not support APOC procedures. The transformation engine transparently rewrites queries:
| APOC Procedure | openCypher Replacement |
|---|---|
apoc.path.expand |
Variable-length path patterns |
apoc.algo.dijkstra |
Neptune shortest path / Gremlin |
apoc.periodic.iterate |
Batched UNWIND queries |
apoc.create.node |
Standard CREATE |
apoc.merge.node |
MERGE with ON CREATE SET / ON MATCH SET |
Unknown APOC procedures throw UnsupportedQueryError.
Component 3: MCP Server Container (ECS Fargate)
- Runs
UnifiedMCPServer.jsinfullscenario mode - 1 vCPU, 2GB memory, minimum 1 task (avoids cold starts)
- Auto-scales based on request volume
- Connects to Neptune + OpenSearch via VPC private networking
- Pulls secrets from Secrets Manager at startup via task IAM role
Component 4: API Gateway + CloudFront Layer
- CloudFront distribution with WAF (rate limiting, geo-restriction, SQL injection protection)
- API Gateway routes
/mcpto ECS Fargate via ALB - Cognito user pool for OAuth 2.0 token validation (RFC 9728)
- Protected Resource Metadata endpoint for MCP client discovery
Component 5: CDK Infrastructure Stacks
| Stack | Resources | Dependencies |
|---|---|---|
MdcVpcStack |
VPC, subnets, NAT Gateway, VPC endpoints | None |
MdcDataStack |
Neptune, OpenSearch, EFS, S3 | VpcStack |
MdcSecurityStack |
Cognito, WAF, Secrets Manager, IAM roles | VpcStack |
MdcServerStack |
ECS, Fargate, ALB, API Gateway, CloudFront | All above |
Component 6: Configuration & Secrets Management
| Legacy Config | AWS Service | Key Path |
|---|---|---|
NEO4J_PASSWORD |
Secrets Manager | mdc-mcp-rag/neptune/credentials |
CHROMADB_URL |
SSM Parameter Store | /mdc-mcp-rag/opensearch/endpoint |
NEO4J_URI |
SSM Parameter Store | /mdc-mcp-rag/neptune/endpoint |
GITHUB_TOKEN |
Secrets Manager | mdc-mcp-rag/github/token |
| Auth tokens | Cognito | User pool client credentials |
Data Models
OpenSearch Index Design (replacing ChromaDB)
| ChromaDB Collection | OpenSearch Index | Documents | Notes |
|---|---|---|---|
code-with-context-v8-0-0 |
mdc-code-context |
~58,761 | Largest; Python, Fortran, Shell source |
global-workflow-docs-v8-0-0 |
mdc-workflow-docs |
~3,514 | Documentation, READMEs |
jjobs-v8-0-0 |
mdc-jjobs |
~700 | J-Job scripts with structured metadata |
community-summaries |
mdc-community-summaries |
~828 | Hierarchical community embeddings (4 levels) |
ee2-standards-v5-0-0-enhanced |
mdc-ee2-standards |
~34 | EE2/NCO compliance standards |
Index mapping: embedding (knn_vector, 768-dim, nmslib, cosinesimil, hnsw), content (text), metadata (object), source_file (keyword), chunk_id (keyword), collection_name (keyword)
Neptune Graph Schema (replacing Neo4j)
- 28 node labels preserved: FortranSubroutine, FortranFunction, FortranModule, PythonFunction, ShellScript, ShellFunction, EnvironmentVariable, Community, etc.
- 23 relationship types preserved: CALLS, USES, DEFINES, IMPORTS, DEPENDS_ON_ENV, SOURCES, INVOKES, EXECUTES, MEMBER_OF, PARENT_OF, INTERACTS_WITH, etc.
- Pre-computed communities stored as nodes (materialized in Phase 24E-5)
- Bolt-compatible endpoint for openCypher queries
Algorithms
Backend Selection
- Read
DB_BACKENDfrom SSM parameter/mdc-mcp-rag/db-backendor environment variable "aws"→ instantiate OpenSearch + Neptune adapters"legacy"→ instantiate ChromaDB + Neo4j adapters- Unknown → descriptive error
- Connect both adapters, verify health checks pass
- Return adapters to
UnifiedDataAccess(transparent to all 51 tools)
OpenSearch Vector Query
- Generate 768-dim MPNet embedding from query text
- Build k-NN search body with optional metadata filters (bool query + filter clause)
- Execute against OpenSearch index
- Normalize cosine similarity scores to [0, 1]
- Return results in
_formatQueryResults()compatible format
Neptune openCypher Query Adapter
- Scan query for APOC procedure calls
- Transform each known APOC call to openCypher equivalent
- Throw
UnsupportedQueryErrorfor unknown APOC procedures - Execute transformed query via Neptune bolt endpoint
- Convert Neptune records to Neo4j
_recordToObject()compatible format
Data Migration
- Export Neo4j graph dump → S3 staging bucket
- Export ChromaDB collections (embeddings + metadata + content) → S3
- Neptune bulk loader imports graph from S3
- OpenSearch bulk index imports vectors per collection
- Verify: node count, relationship count, document count per index all match legacy
Correctness Properties
13 formal properties validated through property-based testing (fast-check):
| # | Property | Validates |
|---|---|---|
| P1 | Tool Interface Preservation — output JSON schema identical between backends | R3.2 |
| P2 | Adapter Output Compatibility — query output format matches legacy | R1.6, R1.7 |
| P3 | APOC Transformation Semantic Preservation | R2.7 |
| P4 | Data Completeness — node/rel/doc counts match after migration | R4.5–4.7 |
| P5 | Migration Idempotence — re-run produces identical state | R4.8 |
| P6 | Embedding Fidelity — 768-dim vectors bitwise identical after migration | R5.1 |
| P7 | Score Normalization — all cosine similarity scores in [0, 1] | R5.3 |
| P8 | Search Equivalence — ranking within 5% tolerance | R6.1–6.3 |
| P9 | Health Check Accuracy — correct healthy/degraded reporting | R11.1–11.2 |
| P10 | Graceful Degradation — unaffected tools continue working | R11.3, R14.1–14.2 |
| P11 | Secret Non-Exposure — no secrets in logs, outputs, or env dumps | R8.5–8.6 |
| P12 | Configuration Caching — single API call per key per process | R8.3 |
| P13 | Retry Exponential Backoff — 5s, 10s, 20s, max 60s | R14.4 |
Error Handling
| Scenario | Response | Recovery |
|---|---|---|
| Neptune unreachable | Graph tools degraded; filesystem + vector tools continue | CloudWatch alarm; exponential backoff retry |
| OpenSearch index missing | Empty results with warning; health reports degraded | Re-run migration for specific index |
| Unknown APOC procedure | UnsupportedQueryError with procedure name |
Add to replacement map or implement Gremlin fallback |
| Secrets Manager throttled | Use cached secrets; fall back to env vars with warning | Secrets cached 5 min; VPC endpoint avoids internet |
| Migration partial failure | Idempotent re-execution from last watermark | Re-run script; verification asserts count parity |
Implementation Plan
5 Sub-Phases (46A–46E), 17 Task Groups
Phase 46A: Foundation (Week 1–2)
- CDK project scaffolding (
MdcVpcStack,MdcSecurityStack,MdcDataStack) - VPC with public/private subnets, NAT Gateway, VPC endpoints
- Neptune cluster, OpenSearch domain, EFS, S3 staging bucket
- Secrets Manager + SSM Parameter Store entries
resolveConfig()with caching and fallback
Phase 46B: MCP Server on ECS (Week 2–3)
- Database adapter interfaces and implementations (OpenSearch, Neptune, legacy wrappers)
- APOC transformation engine
- Backend selection and
UnifiedDataAccesswiring - Dockerfile,
MdcServerStack(ECS, Fargate, ALB, API Gateway, CloudFront) - Health check and error handling with graceful degradation
Phase 46C: Database Migration (Week 3–5)
- OpenSearch index creation (5 indices with k-NN mappings)
- Data migration script (Neo4j → Neptune, ChromaDB → OpenSearch)
- Migration verification (count parity)
- Search relevance validation (5% tolerance)
Phase 46D: Ingestion Pipeline Adaptation (Week 5–6)
- Adapt 7 ingestion scripts for Neptune (bolt/openCypher) and OpenSearch (bulk API)
- Preserve MPNet embedding model (768-dim)
Phase 46E: Validation & Cutover (Week 6–7)
- CloudWatch dashboards and alarms
- Full 51-tool integration tests against AWS backends
- MCP client configuration cutover
- Legacy system kept as read-only fallback for 2 weeks
Cost Estimate
| Service | Configuration | Monthly Cost |
|---|---|---|
| Neptune Serverless | 1–8 NCU, openCypher | ~$50–200 |
| OpenSearch Serverless | 2 OCU (search + index) | ~$350 |
| ECS Fargate | 1 vCPU, 2GB, min 1 task | ~$36 |
| CloudFront | Moderate traffic | ~$50 |
| ALB | 1 ALB | ~$17 |
| Cognito | <50K MAU (free tier) | $0 |
| Secrets Manager | 5 secrets | ~$2 |
| EFS | 10GB | ~$3 |
| NAT Gateway | 1 AZ | ~$37 |
| Total | ~$545–745/month |
Performance Targets
| Metric | Legacy (Docker) | AWS Target |
|---|---|---|
| Vector query latency | ~50ms (ChromaDB local) | ~100–200ms (OpenSearch) |
| Graph query latency | ~20ms (Neo4j local) | ~50–100ms (Neptune) |
| MCP request E2E | ~200ms (stdio local) | ~500ms (HTTP + auth) |
| Startup time | ~5s (npm start) | ~30s (Fargate cold start) |
| Data migration | N/A | ~2–4 hours (one-time) |
Security Summary
- External access: OAuth 2.0 via Amazon Cognito (RFC 9728)
- Internal (VPC): IAM roles for ECS tasks → Neptune, OpenSearch, Secrets Manager
- Neptune: IAM authentication (no username/password)
- Network: All databases in private subnets, VPC endpoints for AWS services
- Encryption: KMS at rest (Neptune, OpenSearch, EFS, S3), TLS 1.2+ in transit
- WAF: Rate limiting, geo-restriction, SQL injection protection on CloudFront
- No secrets in CDK outputs, environment variables, or container logs
Dependencies
| Dependency | Version | Purpose |
|---|---|---|
| AWS CDK | v2.x | Infrastructure as code |
@opensearch-project/opensearch |
^2.x | OpenSearch Node.js client |
neo4j-driver |
^5.x | Neptune bolt protocol (openCypher) |
@aws-sdk/client-secrets-manager |
latest | Secrets retrieval |
@aws-sdk/client-ssm |
latest | Parameter Store access |
@modelcontextprotocol/sdk |
existing | MCP protocol (unchanged) |
sentence-transformers |
existing | MPNet embedding model (unchanged) |
fast-check |
^3.x | Property-based testing |
Generated from Kiro spec artifacts in .kiro/specs/aws-infrastructure-port/ (requirements.md, design.md, tasks.md) on March 20, 2026.