VRO On‐Call Overview - department-of-veterans-affairs/abd-vro GitHub Wiki
Being On-Call
Welcome to your on-call shift! This guide provides an overview and links to essential resources to help you efficiently manage your responsibilities.
On-Call Engineer Core Purpose
- Proactive Monitoring: Serves as a backstop to automated monitoring, proactively addressing anomalies to enhance service quality.
- Team Shield: Protect the development team from disruptions caused by unplanned work, allowing them to maintain focus and productivity.
- Rapid Response: Respond immediately to incidents, manage deployments, and communicate effectively with stakeholders.
On-Call Roles and Responsibilities
For more details, please take a look at the On-Call Responsibilities.
Primary On-Call Engineer Duties
The on-call engineer's duties are outlined in priority order, particularly within the context of Incident Management:
- Production Issues:
- Respond immediately to incidents and alerts from monitoring systems like PagerDuty, #benefits-vro-on-call, or #benefits-vro-alerts, prioritizing immediate resolution.
- Regularly check system metrics and verify the success of deployments.
- Expedite resolution for hotfixes, root cause analysis (RCA) work, etc.
- Monitor key communication channels (for support or incident-related discussions). See the VRO Communication Channels doc for Slack and Microsoft Team channels details.
- Blockers:
- Address any issues that may block team productivity, such as problems with QA environments, CI infrastructure, test failures, or deployment failures.
- Unplanned Work:
- Track requests from communication channels like Slack and other relevant team channels for additional support needs.
- Planned Work:
- Handle routine production tasks during business hours, including non-urgent alerts and software release approvals.
- Prioritize immediate response to critical incidents over less time-sensitive tasks.
Secondary On-Call Engineer
- Support Role:
- Assist the primary engineer and take over if they're unavailable.
- May handle non-urgent tasks and routine production duties, allowing the primary engineer to focus on critical incidents.
Shift Schedule and Handover Procedures
- Availability: On-call engineers should be available during working hours (9 AM—5 PM ET) and ensure prompt responses to pages according to criticality.
- Timing: The on-call rotation aligns with the sprint schedule and covers each sprint's start to end.
- Handover: Document ongoing issues, communicate important updates, and ensure a smooth transition to the next engineer.
Quick Reference
Key Contacts and Escalation Paths
- Internal Contacts: See the Team Contact List for internal leads.
- External Contacts and Issue Escalation: For partner team support, refer to VRO Services, Points of Contact, and Issue Escalation Paths.
Essential On-Call Tools and Resources
Note: some of the below could be further consolidated into single documents and simplified to have less content.
- Monitoring Tools:
- PagerDuty Incident Dashboard
- DataDog Dashboards: Links to dashboards and instructions on gaining access
- Incident Resources:
- Incident Response Guide
- Incident Reports: Log all SEV 1 and SEV 2 incidents using the Incident Report Slack Workflow and document details in the Incident Reports Wiki.
- Post-Incident Reviews (Private Repo)
- Metrics: Track MTTR and other metrics for continuous monitoring improvements. See the Metrics Documentation.
- Regular On-Call Task Resources:
- On-Call Responsibilities
- On-Call Runbooks
- Deployments
- Dependabot On-Call Responsibility: Instructions for managing Dependabot PRs
- Recurring On-Call Sprint Work Issue Tracking: Log ongoing issues in recurring GitHub issues (e.g., #3384, #3439, #3499) to maintain visibility and track resolutions.
- SecRel Resources:
- Tools:
- Aqua (VA intranet)
- Snyk: Scan results on the internal repository's Security tab
- BEP Intake Form (VA intranet)
- Benefits Web Services Page (VA intranet)