VRO On‐Call Overview - department-of-veterans-affairs/abd-vro GitHub Wiki

Being On-Call

Welcome to your on-call shift! This guide provides an overview and links to essential resources to help you efficiently manage your responsibilities.

Proactive Monitoring: Serves as a backstop to automated monitoring, proactively addressing anomalies to enhance service quality.
Team Shield: Protect the development team from disruptions caused by unplanned work, allowing them to maintain focus and productivity.
Rapid Response: Respond immediately to incidents, manage deployments, and communicate effectively with stakeholders.

For more details, please take a look at the On-Call Responsibilities.

The on-call engineer's duties are outlined in priority order, particularly within the context of Incident Management:

Production Issues:
- Respond immediately to incidents and alerts from monitoring systems like PagerDuty, #benefits-vro-on-call, or #benefits-vro-alerts, prioritizing immediate resolution.
- Regularly check system metrics and verify the success of deployments.
- Expedite resolution for hotfixes, root cause analysis (RCA) work, etc.
- Monitor key communication channels (for support or incident-related discussions). See the VRO Communication Channels doc for Slack and Microsoft Team channels details.
Blockers:
- Address any issues that may block team productivity, such as problems with QA environments, CI infrastructure, test failures, or deployment failures.
Unplanned Work:
- Track requests from communication channels like Slack and other relevant team channels for additional support needs.
Planned Work:
- Handle routine production tasks during business hours, including non-urgent alerts and software release approvals.
- Prioritize immediate response to critical incidents over less time-sensitive tasks.

Support Role:
- Assist the primary engineer and take over if they're unavailable.
- May handle non-urgent tasks and routine production duties, allowing the primary engineer to focus on critical incidents.

Availability: On-call engineers should be available during working hours (9 AM—5 PM ET) and ensure prompt responses to pages according to criticality.
Timing: The on-call rotation aligns with the sprint schedule and covers each sprint's start to end.
Handover: Document ongoing issues, communicate important updates, and ensure a smooth transition to the next engineer.

Internal Contacts: See the Team Contact List for internal leads.
External Contacts and Issue Escalation: For partner team support, refer to VRO Services, Points of Contact, and Issue Escalation Paths.

Note: some of the below could be further consolidated into single documents and simplified to have less content.

Monitoring Tools:
- PagerDuty Incident Dashboard
- DataDog Dashboards: Links to dashboards and instructions on gaining access
Incident Resources:
- Incident Response Guide
- Incident Reports: Log all SEV 1 and SEV 2 incidents using the Incident Report Slack Workflow and document details in the Incident Reports Wiki.
- Post-Incident Reviews (Private Repo)
- Metrics: Track MTTR and other metrics for continuous monitoring improvements. See the Metrics Documentation.
Regular On-Call Task Resources:
- On-Call Responsibilities
- On-Call Runbooks
- Deployments
- Dependabot On-Call Responsibility: Instructions for managing Dependabot PRs
- Recurring On-Call Sprint Work Issue Tracking: Log ongoing issues in recurring GitHub issues (e.g., #3384, #3439, #3499) to maintain visibility and track resolutions.
SecRel Resources:
Tools:
- Aqua (VA intranet)
- Snyk: Scan results on the internal repository's Security tab
- BEP Intake Form (VA intranet)
- Benefits Web Services Page (VA intranet)