Playbook 8: SAFe DevOps Playbook - maifors/agile GitHub Wiki
SAFe DevOps Playbook - [Your Organization/Team Name]
Chapter 1: Introduction
Welcome to the SAFe DevOps Playbook for [Your Organization/Team Name]. This document serves as a practical guide for implementing and improving our DevOps practices within the Scaled Agile Framework (SAFe).
1.1 Purpose
The primary purpose of this playbook is to:
- Establish Consistency: Provide a shared understanding and consistent approach to DevOps practices across our teams working on [Specific Value Stream(s)/Product(s)/Team(s)].
- Enable Flow: Outline the processes, tools, and standards that support our Continuous Delivery Pipeline, enabling faster and more reliable value delivery.
- Improve Collaboration: Foster a culture of shared responsibility and collaboration between Development, Operations, Security, and the Business.
- Accelerate Onboarding: Serve as a key resource for new team members to quickly understand our ways of working.
- Drive Continuous Improvement: Act as a living document that captures our current state and supports ongoing refinement of our DevOps capabilities.
1.2 Scope
This playbook applies to all members of the Agile Release Trains (ARTs) and teams involved in the development, testing, deployment, and operation of [Specify the systems, applications, products, or value streams covered by this playbook, e.g., the 'Customer Self-Service Portal' Value Stream]. It covers the entire lifecycle from Continuous Exploration through Release on Demand.
While the principles are broadly applicable, specific toolchains and detailed procedures may initially focus on [Mention specific teams or technology stacks if rolling out incrementally].
1.3 Audience
This playbook is intended for everyone involved in delivering value through our Continuous Delivery Pipeline, including but not limited to:
- Developers
- QA Engineers / Testers
- Operations Engineers / Site Reliability Engineers (SREs)
- Security Engineers (SecOps)
- System Architects
- Product Owners
- Product Management
- Scrum Masters / Team Coaches
- Release Train Engineers (RTEs)
- Business Owners & Stakeholders
1.4 How to Use This Playbook
- As a Guide: Refer to this playbook for guidance on standard practices, tool usage, and process steps.
- For Onboarding: New team members should review relevant sections to understand our DevOps culture and pipeline.
- For Improvement: Use this playbook as a baseline during retrospectives and improvement workshops (like Inspect & Adapt) to identify areas for enhancement.
- As a Living Document: This is not a static document. Expect it to evolve as we learn and improve. Contribute to its evolution!
Navigate the playbook using the table of contents. Focus on the sections most relevant to your role and current tasks.
1.5 Playbook Ownership & Updates
This playbook is owned and maintained by [Specify Owner Team/Role, e.g., the System Team, a DevOps Community of Practice, the RTE].
It will be reviewed and updated on a [Specify Cadence, e.g., quarterly, per Program Increment (PI)] basis, or more frequently as significant changes occur in our processes or tooling.
Suggestions for improvements or corrections can be submitted via [Specify Mechanism, e.g., a dedicated Slack channel #devops-playbook-feedback, a backlog item in Jira/Azure DevOps, contacting the owning team].
Chapter 2: Our SAFe DevOps Approach
This chapter outlines the specific vision, goals, and principles that guide our adoption of SAFe DevOps practices at [Your Organization/Team Name].
2.1 Quick Recap: What is SAFe DevOps?
As introduced in the SAFe framework, DevOps is a combination of mindset, culture, and technical practices that enables collaboration across the entire value stream. For us, it means breaking down silos between Development, Operations, Security, and the Business to build, test, deploy, and release software faster, more frequently, and more reliably. It's the engine driving our Continuous Delivery Pipeline.
2.2 Our Vision & Goals
Our Vision: To create a seamless flow of value from idea to production, enabling rapid delivery of innovative and high-quality solutions that delight our customers and meet business objectives. [Optionally add a link to a more detailed Vision document or presentation: Link to Vision Document]
Our Key Goals: We aim to achieve the following through our DevOps practices:
- Speed: [Quantifiable goal, e.g., Reduce average cycle time for features from X days to Y days by end of Q4]
- Quality: [Quantifiable goal, e.g., Reduce change failure rate from X% to Y%]
- Reliability: [Quantifiable goal, e.g., Improve Mean Time To Restore (MTTR) from X hours to Y minutes]
- Efficiency: [Quantifiable goal, e.g., Increase deployment frequency from monthly to weekly/daily]
- Culture: [Qualitative goal, e.g., Foster a stronger culture of shared ownership and continuous learning across teams]
2.3 Core Principles (CALMR in Our Context)
We embrace the CALMR principles as the foundation of our DevOps approach:
- Culture: We foster a culture of shared responsibility, transparency, and psychological safety. We encourage cross-functional collaboration from the start of feature definition through to production support. We practice blameless post-mortems to learn from failures and promote continuous learning through communities of practice and knowledge sharing. [Mention specific forums like CoPs, shared Slack channels, etc.]
- Automation: We automate relentlessly throughout the Continuous Delivery Pipeline. Our strategy focuses on automating builds, testing (unit, integration, performance, security), deployment, and infrastructure provisioning. Key areas for automation include [List specific priorities, e.g., test data generation, environment setup, compliance checks]. Our target is [e.g., >95% automation of the build-test-deploy process].
- Lean Flow: We apply Lean principles to maximize value delivery and minimize waste. This includes visualizing our workflow (e.g., using [Tool like Jira/Azure Boards]), managing Work-in-Progress (WIP) limits at each stage, aiming for smaller batch sizes, and actively identifying and addressing bottlenecks in the pipeline through value stream mapping and regular reviews.
- Measurement: We measure what matters to understand our performance and drive improvement. Key metrics (detailed in Chapter 6) include DORA metrics (Deployment Frequency, Lead Time for Changes, Change Failure Rate, MTTR), as well as metrics related to flow, quality, and system performance. We use [Dashboarding tool like Grafana/Datadog/Azure Monitor] for visibility.
- Recovery: We design for resilience. Our approach includes automating recovery processes, implementing robust monitoring and alerting, planning for disaster recovery, and using techniques like feature toggles and canary releases to minimize the impact of failures. We conduct regular [e.g., Chaos Engineering experiments, DR tests] to validate our recovery capabilities.
Chapter 3: Our Continuous Delivery Pipeline (CDP)
This chapter details the stages, activities, tools, and practices that constitute our specific Continuous Delivery Pipeline (CDP) at [Your Organization/Team Name].
3.1 Overview
Our CDP represents the workflow, activities, and automation needed to shepherd a new piece of functionality from idea to an on-demand release of value to the end user.
[**Action:** Insert a high-level diagram/visualization of YOUR specific Continuous Delivery Pipeline here. Show the flow through CE, CI, CD, and RoD, highlighting key stages and feedback loops.]
Example Flow: Idea -> Backlog -> CE (Analyze, Refine) -> Implement -> Commit -> CI (Build, Test, Integrate) -> Staging -> CD (Deploy to Prod-like, Test) -> Production -> RoD (Verify, Release, Monitor) -> Feedback
3.2 Continuous Exploration (CE)
CE focuses on understanding market/customer needs and defining solutions.
- Process: Ideas are captured in [Tool, e.g., Aha!, Jira Product Discovery], prioritized during [Event, e.g., PI Planning, Portfolio Sync], analyzed, and refined into Features/Stories in [Backlog Tool, e.g., Jira, Azure DevOps Boards]. We use techniques like [e.g., Hypothesis testing, A/B testing design, Solution workshops].
- Tools: [e.g., Jira, Confluence, Miro, Aha!, Figma]
- Collaboration: Product Management, System Architects, Business Owners, and Agile Teams collaborate closely during PI Planning, backlog refinement sessions, and system demos. Definition of Ready (DoR) criteria ensure features are well-understood before implementation.
3.3 Continuous Integration (CI)
CI is the practice of frequently integrating code changes into a central repository, after which automated builds and tests are run.
- Source Code Management (SCM): We use [SCM Tool, e.g., Git via GitLab/GitHub/Bitbucket]. Our standard branching strategy is [e.g., Trunk-Based Development, GitFlow - specify key branch details like main/master, develop, feature/*, release/*, hotfix/*]. Commit messages must follow [Link to Commit Message Convention, e.g., Conventional Commits].
- Build Process: Automated builds are triggered on every commit to [Specify branches, e.g., main/develop and feature branches] using [CI Server, e.g., Jenkins, GitLab CI, Azure Pipelines, GitHub Actions]. Builds produce [e.g., Docker containers, JAR files] stored in [Artifact Repository, e.g., Nexus, Artifactory, GitLab Package Registry].
- Testing Strategy:
- Unit Tests: Written using [Frameworks, e.g., JUnit, NUnit, Jest]. Target coverage: [e.g., >80%]. Run automatically on commit.
- Component/Integration Tests: Written using [Frameworks, e.g., Postman/Newman, RestAssured, Selenium]. Run automatically after successful build/unit tests.
- Automated Acceptance Tests (AAT): [Frameworks, e.g., Cucumber, SpecFlow, Cypress]. Run in CI pipeline against deployed components in a test environment.
- Code Quality: We enforce code quality standards using [Static Analysis Tool, e.g., SonarQube, Checkstyle] integrated into the CI pipeline. Builds fail if quality gates [Link to Quality Gate criteria] are not met.
3.4 Continuous Deployment (CD)
CD automates the deployment of validated changes to staging and production environments.
- Environment Strategy: We maintain the following environments: [List environments, e.g., Development, Integration Test, Staging/Pre-Prod, Production]. Environments are managed using [e.g., Kubernetes, VMs, Cloud PaaS like Azure App Service/AWS Elastic Beanstalk].
- Deployment Automation: Deployments are automated using scripts/pipelines defined in [CI/CD Tool, e.g., Jenkins, GitLab CI, Azure Pipelines, Argo CD, Spinnaker]. We use [Secrets Management Tool, e.g., HashiCorp Vault, Azure Key Vault, AWS Secrets Manager] for managing sensitive information.
- Deployment Patterns: We primarily use [e.g., Blue-Green deployments, Canary releases, Rolling updates] for production deployments to minimize risk and downtime.
- Infrastructure as Code (IaC): We define and manage our infrastructure using [IaC Tools, e.g., Terraform, Ansible, Pulumi, ARM Templates, CloudFormation]. IaC code is stored in [Version Control System] and versioned alongside application code where applicable. Environment provisioning is [e.g., fully automated, semi-automated].
3.5 Release on Demand (RoD)
RoD separates the deployment process from the release of value, allowing the business to decide when features are made available to end users.
- Release Strategy: Features deployed to production are often initially hidden or exposed to a small subset of users. The decision to release broadly is made by [Role/Group, e.g., Product Management, Business Owners] based on [Criteria, e.g., PI objectives, market events, system demo feedback]. Our target release cadence is [e.g., per PI, monthly, weekly, on demand].
- Techniques: We use techniques like:
- Feature Toggles/Flags: Managed using [Tool/Library, e.g., LaunchDarkly, Unleash, custom solution].
- Canary Releases: Exposing new versions to a small percentage of users first.
- Dark Launches: Deploying backend changes without exposing them via the UI.
- Approval & Verification: The release process involves [Describe steps, e.g., final validation in production, go/no-go decision meeting, communication plan]. Post-release monitoring confirms stability and expected business value.
Chapter 4: Key Practices, Standards & Tools
This chapter provides specific details on the core technical practices, standards, and tools we utilize to enable our Continuous Delivery Pipeline. Adhering to these standards is crucial for consistency, quality, and efficiency.
4.1 Version Control
- System: [e.g., Git]
- Platform: [e.g., GitLab, GitHub Enterprise, Azure Repos, Bitbucket]
- Branching Strategy: We follow the [e.g., Trunk-Based Development with short-lived feature branches, GitFlow] model.
- [Describe key branches: e.g., 'main'/'master' is always production-ready; 'develop' is the integration branch; feature branches use 'feature/...' naming]
- [Link to detailed branching strategy documentation/diagram]
- Commit Messages: Must adhere to the [e.g., Conventional Commits standard, link to standard] format to enable automated changelog generation and semantic versioning.
- Code Reviews: Pull/Merge requests are required for changes to [Specify protected branches, e.g., main, develop]. Reviews require [e.g., at least one approval] from team members. Focus areas include [e.g., correctness, adherence to standards, security, test coverage].
4.2 Build & Integration
- CI/CD Platform: [e.g., Jenkins, GitLab CI/CD, Azure Pipelines, GitHub Actions]
- Build Automation: Builds are defined [e.g., in Jenkinsfiles, .gitlab-ci.yml, azure-pipelines.yml, GitHub Actions workflows]. Scripts use [e.g., Maven, Gradle, npm, Make].
- Artifact Management: Build artifacts ([e.g., JARs, Docker Images, npm packages]) are versioned and stored in [Artifact Repository, e.g., Nexus Repository, JFrog Artifactory, GitLab Package Registry, Azure Artifacts].
- Pipeline Triggers: Builds are automatically triggered [e.g., on commit to any branch, on merge to develop/main].
4.3 Test Automation
- Unit Testing:
- Frameworks: [e.g., JUnit (Java), PyTest (Python), Jest (JS), NUnit (.NET)]
- Standard: Must cover core logic. Minimum coverage target: [e.g., 80% line coverage]. Checked via [e.g., SonarQube quality gate].
- Integration Testing:
- Frameworks: [e.g., RestAssured, Postman/Newman (API), Spring Boot Test]
- Standard: Validate interactions between components/services. Run automatically after unit tests in CI.
- Acceptance Testing (Behavior-Driven Development - BDD):
- Frameworks: [e.g., Cucumber (Java), SpecFlow (.NET), Behave (Python), Cypress (JS E2E)]
- Standard: Tests written in Gherkin syntax based on acceptance criteria. Run against deployed application in [e.g., Test environment].
- Performance Testing:
- Tools: [e.g., JMeter, k6, Gatling]
- Standard: Run [e.g., nightly, before major releases] against [e.g., Staging environment]. Key scenarios and KPIs defined in [Link to Performance Test Plan].
- Test Data Management: Test data is managed using [e.g., scripts, dedicated tools, database snapshots]. Sensitive data must be masked/anonymized.
4.4 Deployment Automation
- Deployment Tools: [e.g., Argo CD, Spinnaker, Azure Pipelines Releases, Jenkins, Octopus Deploy]
- Target Platforms: [e.g., Kubernetes, Azure App Service, AWS ECS/EKS, VMs]
- Deployment Scripts: Defined using [e.g., Helm charts, Kustomize, Terraform, Ansible playbooks, pipeline YAML].
- Secrets Management: Secrets (API keys, passwords, certificates) are managed securely using [Secrets Management Tool, e.g., HashiCorp Vault, Azure Key Vault, AWS Secrets Manager, Kubernetes Secrets] and injected at runtime. Never commit secrets to version control.
- Zero-Downtime Deployments: Achieved using [e.g., Blue-Green deployments, Canary releases, Rolling updates] patterns, configured within [Deployment Tool].
4.5 Monitoring & Logging
- Metrics Collection: [e.g., Prometheus client libraries, OpenTelemetry SDKs, CloudWatch Agent, Azure Monitor Agent] used for application metrics. Infrastructure metrics collected via [e.g., node-exporter, cloud provider services].
- Metrics Storage & Visualization: [e.g., Prometheus + Grafana, Datadog, Azure Monitor Metrics + Workbooks, AWS CloudWatch]
- Logging Aggregation: Applications log to [e.g., stdout/stderr (for containers), files]. Logs are collected using [e.g., Fluentd, Logstash, Promtail] and aggregated in [Log Management System, e.g., Elasticsearch (ELK Stack), Loki, Datadog Logs, Azure Log Analytics, AWS CloudWatch Logs].
- Log Format: Logs should follow [e.g., JSON format, specific standard fields like timestamp, level, correlation ID]. [Link to Logging Standard]
- Alerting: Alerts are configured in [Alerting Tool, e.g., Alertmanager, Grafana Alerting, Datadog Monitors, Azure Monitor Alerts] based on metrics and logs. Key alerts trigger notifications to [Notification Channel, e.g., PagerDuty, Opsgenie, Slack channel #alerts].
4.6 Security (DevSecOps)
- Static Application Security Testing (SAST): Integrated into CI pipeline using [SAST Tool, e.g., SonarQube Security Hotspots, Checkmarx, Snyk Code]. Blocks build on critical/high vulnerabilities.
- Software Composition Analysis (SCA): Scan for vulnerable dependencies using [SCA Tool, e.g., OWASP Dependency-Check, Snyk Open Source, Nexus Lifecycle]. Integrated into CI.
- Dynamic Application Security Testing (DAST): Run [e.g., periodically, before release] against deployed applications in [e.g., Staging environment] using [DAST Tool, e.g., OWASP ZAP, Burp Suite Enterprise].
- Container Security: Scan container images for vulnerabilities using [Container Scanning Tool, e.g., Trivy, Clair, Snyk Container, Prisma Cloud]. Integrated into CI/CD and registry.
- Secrets Detection: Scan code repositories for accidentally committed secrets using [Secrets Scanning Tool, e.g., git-secrets, TruffleHog, Gitleaks]. Run [e.g., pre-commit hook, in CI pipeline].
- Security Standards: Adhere to [Link to Org Security Policies, e.g., OWASP Top 10 mitigation guidelines, CIS Benchmarks]. Security reviews conducted [e.g., as part of design, pre-release].
4.7 Infrastructure & Environment Management
- Infrastructure as Code (IaC): Infrastructure is defined using [IaC Tool, e.g., Terraform, Ansible, Pulumi, CloudFormation, ARM Templates]. Code stored in [Link to IaC Repositories].
- Environment Provisioning: Environments ([e.g., Dev, Test, Staging]) are provisioned/updated [e.g., automatically via pipelines, on demand using scripts].
- Configuration Management: Ensuring consistency across environments using [e.g., Ansible, Chef, Puppet, Kubernetes ConfigMaps].
- Environment Parity: Strive for maximum parity between Staging and Production environments regarding infrastructure, OS, dependencies, and configurations.
Chapter 5: Roles & Responsibilities
While DevOps emphasizes a culture of shared responsibility, understanding how different roles contribute to the value stream is essential for effective collaboration. This chapter outlines key responsibilities within our SAFe DevOps context, focusing on collaboration and breaking down traditional silos.
Core Principle: Everyone involved in building and delivering a solution shares responsibility for its success throughout its lifecycle. "It works on my machine" is not acceptable; "We ensure it works in production" is our goal.
Key Role Contributions (Examples - Adapt to your structure):
- Agile Team (Developers, Testers):
- Develop high-quality code and automated tests (unit, component, integration).
- Integrate code frequently (multiple times per day).
- Define and manage CI/CD pipeline jobs relevant to their application/service.
- Collaborate on defining acceptance criteria and test automation (BDD).
- Incorporate security best practices into code (working with SecOps).
- Ensure applications are observable (logging, metrics).
- Participate in troubleshooting and fixing issues across all environments, including production.
- Contribute to IaC for application-specific infrastructure.
- QA Engineers / Testers:
- Champion quality throughout the lifecycle.
- Develop and maintain automated test frameworks (Acceptance, E2E, Performance).
- Collaborate with POs and Devs on acceptance criteria and test scenarios.
- Analyze test results and provide fast feedback.
- Promote and guide testing best practices within the team.
- Conduct exploratory testing where appropriate.
- Operations Engineers / Site Reliability Engineers (SREs):
- Design, build, and maintain reliable and scalable infrastructure (often using IaC).
- Manage and operate CI/CD tooling, artifact repositories, monitoring/logging systems.
- Define and manage deployment strategies (Blue/Green, Canary).
- Ensure operational readiness (monitoring, alerting, runbooks).
- Lead incident response and conduct blameless post-mortems.
- Define and track Service Level Objectives (SLOs).
- Collaborate with development teams on operability, performance, and reliability.
- Automate operational tasks.
- Security Engineers (SecOps):
- Integrate security tools and practices into the CI/CD pipeline (SAST, DAST, SCA).
- Define security policies and standards (working with Architects).
- Perform security reviews, threat modeling, and penetration testing.
- Consult with teams on secure coding practices and vulnerability remediation.
- Monitor for and respond to security incidents.
- System Architects:
- Define and evolve the overall system architecture, ensuring it supports DevOps goals (e.g., loosely coupled services, deployability).
- Set technical standards and guidelines (including for security, IaC, monitoring).
- Evaluate and select tools and technologies.
- Ensure architectural runway enables continuous delivery.
- Collaborate with teams on solution design.
- Product Owner (PO) / Product Management (PM):
- Collaborate closely with the team during CE to define features and acceptance criteria.
- Prioritize the backlog considering technical enablers and NFRs needed for DevOps.
- Participate in release planning (RoD) and decisions.
- Provide fast feedback on delivered functionality.
- Scrum Master / Team Coach:
- Coach the team on Agile and DevOps principles and practices.
- Facilitate collaboration and communication within the team and with stakeholders.
- Help identify and remove impediments in the workflow.
- Promote a culture of continuous improvement and psychological safety.
- Release Train Engineer (RTE):
- Facilitate ART events and processes, ensuring alignment and flow across teams.
- Help manage risks and dependencies related to the CDP.
- Drive relentless improvement at the ART level, including DevOps practices.
- Ensure tooling and infrastructure support the ART's delivery needs.
Note: Specific responsibilities may vary based on team structure ([e.g., presence of a dedicated Platform team, SRE model adopted]) and individual skillsets. The emphasis is always on collaboration and shared goals.
Chapter 6: Measurement & Continuous Improvement
Effective DevOps requires continuous learning and adaptation. Measurement provides the necessary insights to understand our performance, identify bottlenecks, and guide our improvement efforts. "You can't improve what you don't measure."
6.1 Key Metrics
We focus on a balanced set of metrics that provide insights into speed, stability, quality, and flow. Our primary metrics include the industry-standard DORA metrics:
- Deployment Frequency: How often do we successfully deploy code to production?
- Measure: Number of deployments to production per [time period, e.g., day, week].
- Tool: Measured via [e.g., CI/CD tool deployment logs, custom scripts].
- Goal: [e.g., Increase to daily deployments per service, deploy on demand].
- Lead Time for Changes: How long does it take to get committed code successfully running in production?
- Measure: Median time from code commit to production deployment.
- Tool: Calculated using data from [e.g., Git history and CI/CD deployment timestamps].
- Goal: [e.g., Reduce median lead time to < 1 day].
- Change Failure Rate: What percentage of deployments to production result in degraded service or require remediation (e.g., rollback, hotfix)?
- Measure: (Number of failed deployments / Total number of deployments) * 100%. Failure identified via [e.g., monitoring alerts, automated post-deployment tests, incidents].
- Tool: Tracked via [e.g., incident reports linked to deployments, CI/CD tool data].
- Goal: [e.g., Maintain < 15% failure rate].
- Time to Restore Service (MTTR): How long does it typically take to recover from a failure in production?
- Measure: Median time from when a failure occurs (or is detected) until service is restored.
- Tool: Measured via [e.g., incident management system timestamps, monitoring alert durations].
- Goal: [e.g., Reduce median MTTR to < 1 hour].
Other Key Metrics We Track:
- Flow Metrics: [e.g., Work in Progress (WIP), Cycle Time, Throughput - specify how/where these are tracked, e.g., Jira/Azure Boards dashboards].
- Quality Metrics: [e.g., Automated test coverage %, Defect escape rate, Security vulnerability counts/severity].
- System Performance & Availability: [e.g., Application response times, error rates, uptime %, SLO adherence].
- Value Metrics: [e.g., Feature usage, customer satisfaction (CSAT/NPS), business impact - where applicable].
[Link to detailed metric definitions and calculation methods]
6.2 Dashboards & Visibility
Metrics are most effective when visible and accessible to everyone.
- Primary Dashboard: Key metrics (especially DORA) are displayed on [Dashboard Tool, e.g., Grafana, Datadog, Azure Dashboards, custom dashboard].
- [Link to Primary DevOps Dashboard]
- Team-Level Dashboards: Teams may maintain more detailed dashboards relevant to their specific service or context in [Tool(s)].
- Review: Metrics are reviewed regularly during [e.g., Daily Stand-ups, Iteration Reviews, System Demos, specific DevOps sync meetings].
6.3 Feedback Loops
We rely on multiple feedback loops to gather qualitative and quantitative data for improvement:
- Pipeline Feedback: Fast feedback from automated tests, quality gates, and deployment status within the CI/CD pipeline.
- Monitoring & Alerting: Real-time feedback on system health and performance from production.
- Team Retrospectives: Regular reflection on team processes, practices, and collaboration during Iteration Retrospectives.
- System Demos: Feedback from stakeholders on the delivered functionality and value.
- Inspect & Adapt (I&A) Workshop: ART-level reflection on the PI, including quantitative measurement and problem-solving workshops.
- User Feedback: Direct feedback from end-users via [e.g., support channels, surveys, usability testing].
6.4 Improvement Cadence
Continuous improvement is an ongoing activity, integrated into our regular SAFe cadences:
- Iteration Level: Teams discuss pipeline performance, flow issues, and potential improvements during Iteration Retrospectives and plan improvement items for the next iteration.
- Program Increment (PI) Level: During the Inspect & Adapt (I&A) workshop, the ART reviews PI-level metrics (including DevOps metrics), identifies systemic problems, and defines improvement backlog items for the next PI.
- DevOps Community of Practice (CoP): [If applicable] Our DevOps CoP meets [Cadence, e.g., monthly] to share learnings, discuss challenges, propose standard improvements, and update this playbook. [Link to CoP page/channel]
- Targeted Improvements: Specific improvement initiatives may be tracked as Enablers on the ART or team backlogs.
Chapter 7: Culture & Collaboration
Technology and automation are essential for DevOps, but they are ineffective without the right culture. This chapter outlines the cultural norms and collaboration practices that enable our DevOps approach.
7.1 Communication Channels & Practices
Open, transparent, and efficient communication is vital.
- Primary Channels:
- [e.g., Slack/Microsoft Teams] for real-time chat and team channels ([List key channels, e.g., #team-name, #devops-alerts, #devops-cop]).
- [e.g., Jira/Azure DevOps] for backlog management and work item tracking.
- [e.g., Confluence/SharePoint] for documentation and knowledge base.
- Regular SAFe events (Stand-ups, Iteration Planning/Review/Retrospective, PI Planning, System Demo, I&A).
- Practices:
- Be responsive and respectful in communications.
- Prefer open channels over private messages for team-relevant topics.
- Document decisions and key information in shared spaces ([e.g., Confluence, work items]).
- Assume positive intent.
7.2 Shared Ownership
We operate with a "you build it, you run it" mindset, extended to "we build it, we run it."
- Collective Responsibility: Teams share responsibility for their service/application's quality, performance, security, and reliability across the entire lifecycle, including production support.
- Cross-Functional Collaboration: Developers, Testers, Ops, Security, and others work together from the beginning (CE) through operations. Silos are actively discouraged.
- Empowerment: Teams are empowered to make decisions about their services and pipelines within the established standards and architectural guidelines.
7.3 Blameless Post-Mortems
When incidents or failures occur, our focus is on learning and system improvement, not on blaming individuals.
- Purpose: To understand the contributing factors (technical, process, human) that led to an incident and identify actions to prevent recurrence or reduce impact.
- Process:
- Incident occurs and is resolved.
- A post-mortem meeting is scheduled involving relevant parties.
- A facilitator guides the discussion using a standard template.
- Focus on "what happened?" and "how can we improve the system?" rather than "who made a mistake?".
- Actionable improvement items are identified, assigned owners, and tracked.
- Resources:
- [Link to Blameless Post-Mortem Template]
- [Link to repository/page where post-mortem reports are stored]
7.4 Knowledge Sharing
Continuously sharing knowledge and learnings across teams is crucial for collective growth and avoiding repeated mistakes.
- Mechanisms:
- Documentation: Maintaining up-to-date documentation in [Wiki/Knowledge Base Tool, e.g., Confluence, SharePoint], including architectural decisions, runbooks, and process guides.
- Community of Practice (CoP): Our [e.g., DevOps CoP, Testing CoP] provides a forum for sharing best practices, tools, and experiences. [Link to CoP page/channel]
- Cross-Team Demos/Show-and-Tells: Teams share interesting work or learnings during [e.g., System Demos, specific CoP meetings, informal lunch & learns].
- Code Reviews: Used not just for quality but also for sharing knowledge of the codebase and patterns.
- Pair Programming / Mob Programming: Encouraged for complex tasks or knowledge transfer.
Chapter 8: Getting Started & Resources
This chapter provides resources to help team members get started with our DevOps practices and find further information or assistance.
8.1 Onboarding Checklist for New Team Members
Welcome to the team! To get up to speed with our DevOps ways of working, please work through the following steps with your onboarding buddy or manager:
- Read Chapters 1, 2, and 7 of this Playbook to understand our purpose, principles, and culture.
- Review Chapter 5 to understand your role's contribution to the DevOps lifecycle.
- Get access to key systems:
- Version Control: [Link to Git Platform]
- CI/CD System: [Link to CI/CD Tool]
- Work Item Tracking: [Link to Jira/Azure DevOps]
- Documentation: [Link to Confluence/SharePoint]
- Monitoring/Logging: [Link to Monitoring Tool]
- Communication: [Link to Slack/Teams] & join relevant channels ([e.g., #team-channel, #devops-cop])
- Review the specific pipeline details for your team/service (referencing Chapters 3 & 4).
- [Link to team-specific pipeline documentation/diagram, if applicable]
- Set up your local development environment according to [Link to Dev Environment Setup Guide].
- Pair with a team member to walk through the CI/CD pipeline for a recent change.
- Complete mandatory training modules: [Link to Training Portal/List]
- Introduce yourself in the [e.g., #introductions Slack channel].
8.2 Training Resources & Learning Paths
Continuous learning is key to DevOps. We encourage you to utilize the following resources:
- Internal Training:
- [Link to internal DevOps 101 training module]
- [Link to internal Secure Coding training]
- [Link to internal training on specific tools like Jenkins, Terraform, Kubernetes, etc.]
- [Link to recordings of past CoP meetings or tech talks]
- External Resources:
- SAFe DevOps Course: [Link to relevant Scaled Agile course page]
- DORA Metrics: [Link to Google Cloud DORA page or DevOps Research and Assessment site]
- [Link to recommended external blogs, books, or online courses, e.g., Pluralsight paths, Udemy courses]
- Learning Paths: [Optional: Link to defined learning paths for specific roles or skills]
8.3 Links to Key Tools & Documentation
Quick access to frequently used tools and documentation:
- Playbook Source: [Link to where this playbook is stored/managed, e.g., Confluence, Git repo]
- Version Control: [Link to Git Platform]
- CI/CD: [Link to CI/CD Tool]
- Artifact Repository: [Link to Nexus/Artifactory]
- Work Tracking: [Link to Jira/Azure DevOps]
- Documentation: [Link to Confluence/SharePoint]
- Monitoring Dashboard: [Link to primary Grafana/Datadog/etc. dashboard]
- Logging: [Link to ELK/Loki/etc. query interface]
- Secrets Management: [Link to Vault/Key Vault]
- IaC Repositories: [Link to main Terraform/Ansible repos]
- Post-Mortem Reports: [Link to storage location]
- DevOps CoP Page: [Link to CoP Wiki/Channel]
8.4 Who to Contact for Help
If you have questions or need assistance:
- Your Team: Your first point of contact should be your teammates or your Scrum Master/Team Coach.
- DevOps Community of Practice: Ask questions in the [e.g., #devops-cop Slack channel].
- Platform/SRE Team: For issues related to shared infrastructure, CI/CD platform, monitoring tools: Contact via [e.g., #platform-support channel, Jira Service Desk project]
- Security Team: For security-related questions or consultations: Contact via [e.g., #security channel, email alias]
- Playbook Owner: For suggestions or corrections to this playbook: Contact [Owner Team/Role mentioned in Sec 1.5] via [Mechanism mentioned in Sec 1.5].
Chapter 9: Glossary
Definitions of common terms used within this playbook and our SAFe/DevOps context.
- Agile Release Train (ART): A long-lived team of Agile teams, which, along with other stakeholders, incrementally develops, delivers, and where applicable operates one or more Solutions in a value stream.
- Automation: The use of technology to perform tasks or processes with minimal human intervention, crucial for CI/CD pipelines.
- Blue-Green Deployment: A deployment strategy where two identical production environments ("Blue" and "Green") are maintained. New versions are deployed to the inactive environment and traffic is switched over once validated.
- CALMR: SAFe's approach to DevOps, representing Culture, Automation, Lean Flow, Measurement, and Recovery.
- Canary Release: A deployment strategy where a new version is released to a small subset of users/servers first to test it before rolling it out more broadly.
- Change Failure Rate: (DORA Metric) The percentage of deployments to production that result in degraded service and require remediation.
- Continuous Delivery (CD): The practice of automating the delivery of software to selected environments (like testing or production) after passing automated tests. Builds on CI.
- Continuous Deployment (CD): The practice of automatically deploying every change that passes the automated tests through the pipeline to production.
- Continuous Exploration (CE): The process of continually exploring market and user needs, and defining a vision, roadmap, and set of features for a solution.
- Continuous Integration (CI): The practice of frequently merging code changes into a central repository, after which automated builds and tests run.
- Continuous Delivery Pipeline (CDP): Represents the workflows, activities, and automation needed to shepherd a new piece of functionality from ideation to an on-demand release of value to the end user. Includes CE, CI, CD, and RoD.
- Deployment Frequency: (DORA Metric) How often an organization successfully releases to production.
- DevOps: A mindset, culture, and set of technical practices that enables closer collaboration between development, operations, security, and business teams to deliver value faster and more reliably.
- DevSecOps: Integrating security practices within the DevOps process.
- DORA Metrics: Four key metrics identified by the DevOps Research and Assessment (DORA) group as indicators of high-performing teams: Deployment Frequency, Lead Time for Changes, Change Failure Rate, Time to Restore Service.
- Feature Toggle/Flag: A technique allowing code to be deployed but hidden ("toggled off") in production, enabling Release on Demand and A/B testing.
- Infrastructure as Code (IaC): Managing and provisioning infrastructure through machine-readable definition files (code), rather than manual configuration.
- Inspect and Adapt (I&A): A significant event held at the end of each Program Increment (PI), where the current state of the Solution is demonstrated and evaluated. Teams then reflect and identify improvement backlog items via a structured problem-solving workshop.
- Lead Time for Changes: (DORA Metric) The median time it takes for a commit to get into production.
- Mean Time To Restore/Recover (MTTR): (DORA Metric) The median time it takes to restore service after a production failure occurs.
- Program Increment (PI): A timebox (typically 8-12 weeks) during which an Agile Release Train delivers incremental value in the form of working, tested software and systems.
- Release on Demand (RoD): The ability to make value available to customers whenever the business needs it, decoupled from the deployment cadence.
- Site Reliability Engineering (SRE): An engineering discipline that combines software and systems engineering to build and run large-scale, distributed, fault-tolerant systems. Often implements DevOps principles.
- Value Stream: The set of actions that take place to add value to a customer from the initial request through realization of value by the customer.
- [Add any organization-specific terms or acronyms here]
Chapter 10: Appendices (Optional)
This section contains supplementary materials referenced in the playbook or otherwise useful for implementing our DevOps practices.
- Appendix A: Definition of Done (DoD) Checklist
- [Insert or link to your standard Definition of Done, potentially including DevOps-related items like 'monitoring configured', 'deployment automated', 'IaC updated']
- Appendix B: Release Readiness Checklist
- [Insert or link to a checklist used before releasing significant changes, covering functional tests, performance tests, security checks, operational readiness, communication plan, rollback plan]
- Appendix C: Blameless Post-Mortem Template
- [Insert or link to the standard template used for conducting post-mortems]
- Appendix D: Detailed Tool Configuration Guides
- [Optionally include or link to specific guides for setting up or using key tools like Jenkins, Terraform, Grafana, etc., if not covered elsewhere]
- [Add other appendices as needed]
End of Playbook