Infrastructure & DevOps - sgajbi/portfolio-analytics-system GitHub Wiki
This phase focuses on productionizing the Portfolio Analytics System with containerization, continuous integration/deployment (CI/CD), observability, monitoring, and scalable deployments.
Recent improvements include standardized logging with correlation IDs and enhanced operational traceability across services.
- Dockerize all microservices with optimized, production-ready Dockerfiles.
- Set up a CI/CD pipeline for automated builds, tests, and deployments.
- Implement centralized logging with correlation IDs for distributed tracing.
- Implement correlation ID persistence in
processed_events
for operational debugging. - Deploy services on Kubernetes with proper manifests and Helm charts.
- Enable autoscaling based on Kafka event load.
- Docker: Containerize services using multi-stage builds.
- CI/CD: GitHub Actions, GitLab CI, or Jenkins pipelines.
-
Logging: Splunk or ELK stack (Elasticsearch, Logstash, Kibana), all logs enriched with
correlation_id
. - Monitoring: Prometheus and Grafana.
- Tracing: Correlation ID flow through Kafka headers and logs.
- Container Orchestration: Kubernetes (AKS, EKS, or GKE).
- Autoscaling: KEDA for Kafka event-driven scaling.
- Secrets Management: Vault or cloud provider secrets manager.
- Infrastructure as Code: Helm charts and Kubernetes manifests.
- Each service has a Dockerfile optimized for minimal image size and faster builds.
- Use environment variables and secrets injection for configuration.
- Automatically build Docker images on commit.
- Run unit and integration tests.
- Push images to container registry.
- Deploy to Kubernetes clusters using Helm or kubectl.
-
Correlation ID Standardization:
-
All services generate or propagate
correlation_id
in the format<svc-shortname>:<uuid>
. -
Logged in every service with the pattern:
[LEVEL] [corr_id=ING:uuid] ServiceName - Message
-
Correlation ID flows via Kafka headers and is included in API responses as
X-Correlation-ID
.
-
-
Centralized Logging:
- All logs are shipped to Splunk/ELK with correlation_id field indexed.
- Allows querying an event across multiple services by a single correlation ID.
-
Debugging via Database:
-
processed_events
table storescorrelation_id
for every processed event. -
Ops can trace event status and logs via:
SELECT * FROM processed_events WHERE correlation_id='ING:uuid';
-
-
Collect service metrics with Prometheus exporters.
-
Configure Grafana dashboards for:
- Event processing latency.
- Kafka topic lag.
- Error rate by service.
- Define manifests for Deployments, Services, ConfigMaps, and Secrets.
- Use Helm charts for templated deployments.
- Configure Kafka consumer autoscaling with KEDA based on queue lag.
- Load testing of entire analytics pipeline.
- Failover and recovery drills.
- Monitoring alert rules for SLA adherence.
- Validation of correlation ID propagation and log visibility in Splunk/ELK.