WMCore Developers on shift - dmwm/WMCore GitHub Wiki

This Document is supposed to serve as a short list of responsibilities to be covered during shift weeks by the WMCore Developers.

Usually the developers in WMCore team share the load from operational responsibilities, but a portion of those are regular ones, like following meetings and providing support to other teams which use to cost a lot of time during which a parallel task requiring strong concentration is difficult to follow. The shift week is a week during which one developer is dedicated to cover most of the operational activities with a regular schedule and hence during that week his time is mostly filled with meetings and debugging. A non exhaustive list is provided bellow.

Shift Duration:

WMCore shift lasts a week, it starts on Monday and goes through Sunday. The person on shift is supposed to communicate with stakeholders and perform any required debugging that comes up between Monday 8am and Friday 5pm (local shifter time). Non-critical issues that arise late Friday or over the weekend will be dealt with by the upcoming WMCore shifter, starting on Monday. Critical issues that happen during the weekend will be addressed in a best-effort mode.

Shifter's Responsibilities:

Meetings - Besides our own weekly meeting, we do need to cover a set of regular meetings with other teams during which we try to provide useful technical information on the pieces of the WMCore system every team uses. For some of the meetings we do have the agreement with the people leading the meeting to have the WMCore section at the beginning, but it is useful to stay to the very end even tough not active, because many times we are asked questions which pop up on the go while discussions are ongoing. During those meeting we also tend to keep the other teams on track with our schedule of regular deployments and updates as well as with major changes or important bug fixes concerning them.
Producing internal reports - The WMCore developer on shift is to be serving as a contact between the outer world and the rest of the team, so upon every meeting (we tend to keep that interval short while the info is still fresh), (s)he provides a list of topics discussed during the meeting just followed, together with the replies (s)he could or could not give or eventual outcomes if a solid decision has been taken. In some of the cases these result in action items on us, so we need to be sure each of us is on track. If an GH issue needs to be created for following such an action, most of the time we request the person who brought up the topic to create the GH issue according to the templates we have provided and we follow through there.
Support - ideally, the person on call is expected to reply any inquiries within the same day (if during business hours).
- During those weeks many teams have questions asked through the various channels of communication we follow, concerning internals of the system to which only we can provide information, many of them concerning not only different APIs and system behavior but also policies discussed far back in time and well forgotten.
- Many times we need to provide support in debugging issues (especially with P&R Team) which are exceeding the level of knowledge about the system itself, not only of the people using it and asking the question, but also our won too.
System monitoring - We need to constantly monitor the health of the system - 24/7. We need to be sure about:
- we do provide an uninterrupted usage for everybody who depends on WMCore system
- we do not have components down resulting in stuck load and overfilling the system in short amount of time
- we do provide the service bug free, and mostly taking care the way of working of the whole system to not result in data loss or corruption, e.g. because of continuous misbehavior of a component or an overlooked bug - this is in general difficult task not only during shift weeks.
Debugging: The debugging we normally do is usually triggered/a followup on one the following three categories:
- on demand:
  - Most of the time these are requests from other teams such as P&R who are looking at the general system load and are reporting misbehavior, which is noticeable in the payload - workflows' behavior.
- on system failure
  - Those are pretty visible cases when a component breaks badly (either completely or with a cyclic pattern) and causes accumulation of big amount of backlog in some layer of the system. NOTE: It is not always mandatory the congested part of the system to be directly linked with the broken component, sometimes the backlog may be accumulated few stages after the misbehaving piece.
- on bug discovery - not always leading to an immediate system failure NOTE: The established practice is to create a follow up GH issue right after one of the above three cases is met, and this issue to be communicated with the rest of the team. Usually the person on shift who starts the debugging takes the issue, but this is not mandatory. Many times someone else may have more knowledge about the problem at hand or an emmergency debugging may need to span beyond a single shift period and another person may need to take over. This is to be communicated internally.

Examples of typical debugging issues:

A good place to look:

Here is a wiki we started long ago for accumulating well known misbehavior cases and possible actions to mitigate the effects of them (This still needs to be updated on a regular basis though. ): https://github.com/dmwm/WMCore/wiki/trouble-shooting

Extra responsibilities

Possible responsibilities agreed upon in the past, but ones which could not fit in a fairly manner, because of the hard misaligned between the schedules of deployment cycles and shift weeks rotation:

Release validation - we decided to follow that in github issues and assign them on a mutual agreement
Monitor and support to CMSWEB Team during regular central services deployments - this more or less still holds as a pure responsibility to the person on shift, even though sometimes one of us needs to follow few consecutive cycles.
WMAgent deployment campaigns - currently mostly driven by Alan, because of many reasons, but we can cover him at any time if needed. The draining and monitoring is still a shared responsibility.

Developer's Responsibilities:

The more broad responsibilities of every developer in the WMCore team are listed in the following wiki: https://github.com/dmwm/WMCore/wiki/WMCore-developer-responsibilities

Channels to follow:

Slack channels: (DEPRECATED)
- P&R (cms-compops-pnr.slack.com) - actively watching the #wmcore-support channel
- WMCore (cms-dmwm.slack.com) - actively watching all channels. Special attention to:
  - #tier0-dev: to communicate with the T0 team
  - #wmcore-rucio: to communicate with a very small set of the DM experts
  - #wmagent-dev: our internal communication (it should be followed even when you are not on shift).
- Rucio (rucio.slack.com) - passively (only when tagged) watching #cms, #cms-ops and #cms-consistency
Mattermost channels under the CMS O&C organization:
- DMWM: the WM team is expected to follow this dmwm channel in a daily basis, regardless of being on shift duty or not.
- WM Dev: the WM team is expected to follow this wm_dev channel in a daily basis to stay up-to-date with developments involving the WM system.
- WM Ops: the WM developer on shift is expected to be the first line of contact through this wm_ops channel. It's advised to monitor this at least twice a day. Nonetheless, it is recommended to have the WM team following this as well to be on top of potentially operational issues.
- WM Team: the wm_team channel is private and dedicated only to the core WM developers. Please also use this channel for sharing meeting summaries with the rest of the team. The WM team is expected to follow it in a daily basis as well.
- Everything else that may concern us in the O&C group (e.g. SI..) - people are used to tag us explicitly if we are needed somewhere
Email groups:
- "cms-oc-dmwm (CMS DMWM team)" <cms-oc-dmwm cern.ch>
- "cms-tier0-operations (CMS tier0 operations)" <cms-tier0-operations cern.ch>
- "cms-comp-ops-workflow-team (cms-comp-ops-workflow-team)" <cms-comp-ops-workflow-team cern.ch>

Meetings to follow:

Monday:
- WMCore - 16:00 CERN Time: indico
- CompOps - 17:00 CERN Time: indico
Tuesday:
- T0 - 14:00 CERN Time (first ~15min only): twiki page
Wednesday:
- O&C - 15:00 CERN Time: indico
- P&R - 16:00 CERN Time (first ~15min only): google doc
Friday:
- P&R development - 16:00 CERN Time: zoom, agenda

Monitoring we use:

WMAgent dashboard: https://monit-grafana.cern.ch/d/lhVKAhNik/cms-wmagent-monitoring?orgId=11
Jobs dashbords:
- CMS Job monitoring 12 min: https://monit-grafana.cern.ch/d/o3dI49GMz/cms-job-monitoring-12m?orgId=11
- CMS Job monitoring 12min bars: https://monit-grafana.cern.ch/d/chVH8ZoGk/cms-job-monitoring-12m-bars?orgId=11
- CMS Job Monitoring ES agg data: https://monit-grafana.cern.ch/d/000000628/cms-job-monitoring-es-agg-data-official?orgId=11&refresh=15m
- CMS Job Monitoring ES agg data: https://monit-grafana.cern.ch/d/Y08Xu0oGz/cms-job-monitoring-es-agg-data-official-bars?orgId=11&refresh=15m
WMStats: https://cmsweb.cern.ch/wmstats/index.html
The place to check/maintain the list of all currently active agents, is the following GH project board: https://github.com/dmwm/WMCore/projects/5
The place where to check the latest WMAgent versions/releases is the following project board: https://github.com/dmwm/WMCore/projects/29

Access rights and credentials:

In order to be able to fulfill ones duties during the shift, the developer must have access to both CERN and FNAL agents. These are steps which have already been mentioned in the onboarding document here. And to elaborate a little bit on both types of agents we work with:

Access to FNAL agents:
- First you need to have access to the FNAL computing resources, for which you need to send the proper request form as explained at Fermilab's site here.
- Second you will need to contact the operators managing the FNAL schedds so that your username is given access to the proper set of machines and to be added to the proper groups and service accounts - meaning cmsdataops. The change may take effect only once FNAL regular puppet run has passed a cycle.
Access to CERN agents:
- One needs his regular CERN account for that and needs to contact he VOC in order to give him the same access as for FNAL, with slight difference - the service account should be cmst1
- For accessing the cmst1 user without the need of a password one needs to do sudo instead of su, this way the individual kerberos credentials are forwarded with the login sessions. For convenience the following alias may be set in everybody's .bashrc file:

alias cmst1='sudo -u cmst1 /bin/bash --init-file ~cmst1/.bashrc'

The full list of machines to get access to is listed here.
CRIC roles one needs - ReqMgr/Data-manager