Patching production machine - dmwm/WMCore GitHub Wiki

Sometimes we need to patch directly production wmagent. (It is not always possible to redeploy the production machine) In that case, we SHOUDN'T directly change the source code in production agent.

The procedure goes I like this.

  1. Identify the problem, and if you need to create the patch. Need to create the fetch according to the production version of agent.

    i.e. If production version is 0.9.7, create the patch against 0.9.7. not the master branch

  2. Throughly test the patch and apply to production.

$ curl https://patch-diff.githubusercontent.com/raw/dmwm/WMCore/pull/[PR_NUMBER].patch | patch -d /data/srv/wmagent/current/apps/wmagentpy3/lib/python*/site-packages/ -p3

Tier0Agent code location: need to change [] with current version of Tier0 and WMAgent respectfully in order below. /data/tier0/srv/wmagent/[2.1.2]/sw/slc7_amd64_gcc630/cms/wmagentpy3/[1.1.12.patch2]/lib/python*/site-packages/

2.1 adding couch patch.

$ curl https://patch-diff.githubusercontent.com/raw/dmwm/WMCore/pull/[PR_NUMBER].patch | patch -d /data/srv/wmagent/current/apps/wmagent/data -p2

2.2 push the couchapp to right application (execute-reqmgr, execute-workqueue, execute-wmagent)

$ $manage execute-agent wmagent-couchapp-init

  1. restart component affected by patch and monitor

    $ $manage execute-agent wmcoreD --restart --component AnalyticsDataCollector

  2. update the patch list in the twiki here

Patching CERN production agents in bulk

The same rules as mentioned above also apply for this case, of course. If you have the cmst1 user pass, you can then ssh to lxplus and run the following long command... wait, before running this command, make sure to:

  • update the list/regex of host names (possibly including the relval node)
  • update the pull request number (replace PR_NUMBER by the correct number)
  • update the --shutdown and --restart commands to properly reflect the component that needs to be restarted

then the skeleton command is as follow (again, from lxplus as cmst1)

for h in vocms0{250,251,252,253,254,255,256,257}; do echo ""; ssh cmst1@$h 'source /data/admin/wmagent/env.sh;
echo -e "\n\n   ********** Patching `hostname` ************";
curl https://patch-diff.githubusercontent.com/raw/dmwm/WMCore/pull/[PR_NUMBER].patch | patch -d apps/wmagent/lib/python2*/site-packages/ -p 3;
$manage execute-agent wmcoreD --shutdown --components=DBS3Upload,JobCreator;
echo -e "\nSleeping 3 seconds ..." && sleep 3;
$manage execute-agent wmcoreD --restart --components=DBS3Upload,JobCreator'; done

check the stdout and make sure the patch was properly applied and components were restarted.

Patching FNAL production agents in bulk

The same rules as mentioned above also apply for this case, of course. As cmsdataops, ssh to one of the FNAL agents (e.g. submit1) and run the following long command... wait, before running this command, make sure to:

  • update the list/regex of host names (possibly including the relval node)
  • update the pull request number (replace PR_NUMBER by the correct number)
  • update the --shutdown and --restart commands to properly reflect the component that needs to be restarted

then the skeleton command is as follow (this time from one of the FNAL schedd nodes)

for h in cmsgwms-submit{3,4,5,6}; do echo ""; ssh cmsdataops@$h 'source /data/admin/wmagent/env.sh;
echo -e "\n\n   ********** Patching `hostname` ************";
curl https://patch-diff.githubusercontent.com/raw/dmwm/WMCore/pull/[PR_NUMBER].patch | patch -d apps/wmagent/lib/python2*/site-packages/ -p 3;
$manage execute-agent wmcoreD --shutdown --components=DBS3Upload,JobCreator;
echo -e "\nSleeping 3 seconds ..." && sleep 3;
$manage execute-agent wmcoreD --restart --components=DBS3Upload,JobCreator'; done

check the stdout and make sure the patch was properly applied and components were restarted.

Copying a new CERN service certificate to all nodes in bulk

For testbed agents, we need to log in to one of the testbed WMAgent nodes (e.g. vocms0192) with the cmst1 user (by typing the password, such that a kerberos token is created). The commands below assume that the new CERN service certificate is provided in AFS (usually Christoph Wissing) and that we want to restart the agent getting the new certificate. NOTE that the list of vocms nodes need to be updated.

for h in vocms0{xxx,yyy}; do echo ""; ssh cmst1@$h 'source /data/admin/wmagent/env.sh;
echo -e "\n\n   ********** Patching `hostname` ************";
$manage stop-agent;
$manage stop-services;
rm -f servicecert.pem servicekey.pem;
cp /afs/cern.ch/user/c/cmscert/transfer/servicecert-vocms0192.pem /data/certs/servicecert.pem;
rm -f /data/certs/servicekey.pem;
cp /afs/cern.ch/user/c/cmscert/transfer/servicekey-vocms0192.pem /data/certs/servicekey.pem;
echo -e "\nSleeping 3 seconds ..." && sleep 3;
$manage start-services;
$manage start-agent'; done