Improving WMAgent deployment - dmwm/WMCore GitHub Wiki

GOAL: Make more deployment/upgrade of WMAgent more maintainable and effective.

Review the current deployment/upgrade procedure.
Identify the problems of current scheme
Create a plan for improvement

Current WMAgent deployment procedure: There are 2 different types of updating the agent in current production system.

Patching WMAgents: When there is a bug found or new feature need to be added. We patches the agent and keeps the record on twiki page. How to patch production machine , Patching status of production WMAgents
Redeploying the WMAgents: When there is storage problem (WMAgent using 85% of storage) or new upgrade requires DB schema change, we completely wipeout data from WMAgent (Draining procedure) and redeploy the new agent.

Problem of current scheme

Patching WMAgents: Although patching procedure is procedure is pretty straight forward, it is the manual procedure. (Need to select the patches to upgrade and apply to individual WMAgents, and maintain the record seperately). This could lead to more human errors (records is not updated correctly) and more different version of WMAgents. (agent with patch 1, 2, 3 and the other agent with patch 1, 2, etc)
Redeploying the WMAgent: It might not be feasible to completely remove a need for redeploying the agent, current procedure take too much time to redeploy the agent (draining with human interaction) and has too short cycle of deployment (~3 month, data size grows to fast)

Plan for improvement

Patching WMAgents: Instead of patching individual agent, upgrade the agent by swapping the code with new version. (With this approach, we can make more uniform status of WMAgents and automate the procedure to reduce human interactions and errors)
- Create new image repository with specific tag and dependencies with external services (dbs client, etc)
- Created new deployment script which only updates codebases (WMCore code, config and other dependencies) without touching the database record from couchch/Oracle/MariaDB
- Puppetize the procedure using implementations above.
Redeploying WMAgents: If we cannot remove totally for redeploying need, we need to do following:
- Reduce the time of deployment. (Identify the bottle neck of draining procedure, remove the manual checking of draining status of WMAgent)
  - currently the automatic draining and speed draining procedure is implemented but iterative work is need to improve the procedure.
- Reduce the data size: Increase the redeployment period (~6 month)
  - There are 2 places identified to control the database size more effectively.
    - DBSBuffer tables never cleaned up - although wmbs db is clean up, we don't clean up DBSBuffer tables (Need to check whether all the files in workflow is done updating DBS, PhEDEx and not used by other workflows)
    - CouchDB need to clean up more effectively, Although we clean up couchdb it seems data size keeps growing (Couch never deletes the metadata) Mostly responsible by jobdump db. (Need to investigate whether we can improve controlling the size)