workqueue couch db replication procedure - dmwm/WMCore GitHub Wiki

Introduction

From time to time we need to cleanup the central couch databases because they accumulate too much deleted docs (>90%), which causes:

a slowdown in the system in general, including replication to the agents workqueue_inbox db
an increase in the views size (central workqueue views are now about 70GB!)
UPDATE: HTTP timeouts - including replication - start hitting us once there is more than 3M deleted documents in the workqueue database.

There are at least two ways to cleanup couch databases, either deleting the whole database (when we do not care about its data, since it can get replicated back), or by temporarily creating another database only with the documents we want (so called permanent docs). For workqueue database in CMSWEB, the latter is the procedure we need to follow.

Since we migrated to CouchServer 1.6.1, the _replicate database is gone and all replication tasks are carried by the replicator db.

It's also worth mentioning that Alan faced several couch weirdness (and sometimes couch/replication crashes), problems like:

permission denied to create documents in the new database (thus requiring user_ctx option, more below)
dozens/hundreds of process trigerring the couch replication, which caused most of the instabilities (can be seen with _active_tasks, which prints TONS of replication tasks instead of 1 only)
replication filter in erlang causes much more replication process to be spawned, increasing the couch server instability.

Step-by-step to replicate workqueue couch db

These instructions are supposed to be performed in one of the developers VM, after copying over the original workqueue.couch database file from the CMSWEB production node. When running these steps on the CMSWEB, there might be minor changes to the commands and an announcement for a short outage.

To start off, we need to temporarily disable all compact and compactviews cron jobs on the backend that will go through this procedure. After updating the crontab, check the couchdb status, if there are any compactions or indexer tasks, stop here and come back later. This is mainly to avoid overloading that backend node with so many couch tasks, in addition to all the standard requests that come through.

Let us start by creating the replication filter function in javascript. This function is very simple, it does NOT replicate documents that are deleted (doc._deleted:true), all the other documents will get replicated. Run the following command in the terminal:

cat > /tmp/filter.js << EOF
function(doc, req) {
    if (doc._deleted){
       return false;
    }
    return true;
}
EOF

Now we need to move this tmp file under the workqueue couchapps directory AND correct/update the ownership and permissions if needed:

mv /tmp/filter.js /data/srv/current/apps/workqueue/data/couchapps/WorkQueue/filters/filterDeletedDocs.js 
sudo chown _sw:_sw /data/srv/current/apps/workqueue/data/couchapps/WorkQueue/filters/filterDeletedDocs.js

Make sure the correct (the one you want to replicate/cleanup) workqueue.couch database is in place and with the right permissions/ownership. File is:

ls -lh /data/srv/state/couchdb/database/workqueue.couch

Before restarting couch server, make sure once again that there are no active tasks running (using the status command). Restart couch server such that it uploads the new filter function (in filterDeletedDocs.js) to the WorkQueue design document:

(A=/data/cfg/admin; cd /data; $A/InstallDev -s stop:couchdb)
(A=/data/cfg/admin; cd /data; $A/InstallDev -s start:couchdb)

OPTIONAL: we might have to fix the permissions of this file to avoid errors in the logs (which has been always there). Not sure CMSWEB needs to do this...:

sudo chmod 770 /data/srv/current/auth/couchdb/hmackey.ini

Then it's time to start the replication by creating a new document in the _replicator database. Run this command:

curl -ks -X POST http://localhost:5984/_replicator -H 'Content-Type:application/json' -d '{"_id":"rep_wq", "source":"workqueue", "target":"newwq", "continuous": true, "create_target":true, "filter":"WorkQueue/filterDeletedDocs", "user_ctx": {"name": "_admin", "roles": ["_admin"]}}'

the description of those options are:

_id: we can define the doc id in couch 1.6.1, which helps to keep track of this document
source: the database to replicate documents from
target: the database to replicate documents to (if they are both local, we can just provide their names)
continuous: set to true to make sure any new documents written to the source will also get replicated to the target
create_target: creates the target database if it does not exist
filter: the filter function which every single source document has to pass before getting replicated
user_ctx: we need to provide a user and its role in order to write documents to the new database (also needed in the CMSWEB environment!)

You get a revision number from the previous command, but you can also check the state of this replication by checking the replication document you posted before.

curl -ks "http://localhost:5984/_replicator/rep_wq"

or checking the couch status command output (there should be a replication task active), couch.log is also your best friend:

(A=/data/cfg/admin; cd /data; $A/InstallDev -s stop:couchdb)

In order to check the status of the replication, one can run the _active_tasks command or get a summary of each database:

curl -ks "http://localhost:5984/_active_tasks" 
curl -ks -X GET 'http://localhost:5984/workqueue'
curl -ks -X GET 'http://localhost:5984/newwq'

once newwq database has the same amount of doc_count as the source database (workqueue) AND it shows "progress":100, replication should be up-to-date and you can now perform the second phase of this intervention.

Second part of the intervention

On the day of swapping the databases in the CMSWEB environment, we need to create a service outage to make sure that no new documents will be written to the old database and not get replicated to the new database. Among the steps required, are:

set access to couch workqueue to nowhere in the frontend map rules, like vocms0731nowhere.cern.ch instead of vocms0731.cern.ch
watch the couch.log until the read/write activities stop (a few periodic GETs should still go in, likely from ServerMonitor)
check again the replication progress (it must be 100%). Also check the number of valid documents in workqueue and newwq.
- IF they are the same, you can stop Couch server.
- ELSE IF workqueue has mores documents than newwq, it means the replication isn't completed
- ELSE IF workqueue has less documents than newwq, then it means some documents were DELETED from the workqueue database while they are still valid in newwq, which is very very likely to happen. Given that we do not replicate deleted docs, those documents need to be "manually" deleted from the newwq with a command like (DOCID and REV needs to be updated):

curl -ks -X DELETE "http://localhost:5984/newwq/db62c4b699594b98db517c653ac063e2?rev=6-36d1c696a60ac9fb874a37c2ee2f6540"

fetch the data volume of the original couch database and its views
AGAIN, databases must have the same doc_count by now.
if the previous step has been well taken care of, it's time to swap the databases. Move the newwq.couch file to workqueue.couch (renaming the old database by the new one)
delete ALL the workqueue view files in /data/srv/state/couchdb/database/.workqueue_design/
at this point you should have the new database in place, and old views completely removed from the system, thus you can restart Couch server
when couch gets restarted, your previous replication document gets removed as well, so at this point there is no more any replication on going.
trigger the views building with the following command:

curl -ks -X GET 'http://localhost:5984/workqueue/_design/WorkQueue/_view/elementsByStatus' | head

restore the frontend maps and that's all!
email whoever saying the intervention is done.

Keep an eye on the couch.log just to make sure it's health. Thanks!

Replication instructions update

The section above is still valid and more detailed, but here is a simple summary of the commands executed in the CouchDB VM backend.

The database had to be manually created (it failed to create with the replication):

curl -X PUT http://localhost:5984/new_wq

Define a localhost replication (note that source/target need to have full url in CouchDB +2.x). If the couch instance is not getting any income requests, we might want to use "continuous": false instead:

curl -ks -X POST http://localhost:5984/_replicator -H 'Content-Type:application/json' -d '{"_id":"rep_wq", "source":"workqueue", "target":"new_wq", "continuous": true, "filter":"WorkQueue/filterDeletedDocs", "checkpoint_interval": 100000}'

Once the replication checkpointed_source_seq has the same value as update_seq in the source database, it means replication is completed.

Now we can delete the replication document (note that you need to use the correct revision number in the command below):

curl -ks -X DELETE "http://localhost:5984/_replicator/rep_wq?rev=BLAH_BLAH"

Latest rotation numbers

Rotation for the workqueue database on 22/Jun/2022, with a database summary as:

{"db_name":"workqueue","doc_count":61306,"doc_del_count":3345978,"update_seq":37812510,"purge_seq":0,"compact_running":false,"disk_size":8065179780,"data_size":2165447904,"instance_start_time":"1655923205137624","disk_format_version":6,"committed_update_seq":37812510}

NOTE that both the database replication and the views indexing were performed while the backend was unavailable (no income load)
where we can see that the disk size was 7.6GB, while actual data size was 2GB.
note that despite having 3.3M deleted documents, update sequence was much higher at almost 38M, thus 38M records to process in the database replication.
views went up from 740MB to ~1.1GB(?) and it took around 30min to build them. Now they are 480MB though
replication from/to localhost in vocms0740 took around 1h50min
database data size went down from 2GB (data-size) to 600MB

Rotation for the workqueue database on 22/Oct/2018, numbers were:

database was taking 1.4GB, with 1.5k valid docs and 930k deleted documents
views were taking 70GB, with 3 or 4 different file views after the rotation happened, the new database numbers were:
time spent to replicate to the new database: < 1h
time spent to rebuild the workqueue views: < 1min
database became 20MB only
views took only 15MB