couchbase indexing and querying performance - sreeragravindran/references GitHub Wiki

Problem Statement

In order to reconcile data with multiple systems, we need to be able to pull the entire list of customerIds ( to start with ) from our couchbase cluster.

select customerId from <bucket> where type='account' and scheme='savings'

This is a heavy workload on the current cluster and running this chokes the query node out of the cluster

Causes

Insufficient Indexes

Index / Query services are running in the same node preventing their independent scalability

  • The index nodes are advised to be run on separate memory intensive machines and query nodes on compute intensive ones, so these can be scaled according to the querying demands as and when necessary.
  • best practices ( https://blog.couchbase.com/indexing-best-practices/ )

Less Resident Ratio

  • the resident ratio is the amount of data residing in memory as compared to disk. In our cluster this is only 10-20%. The issue with this is: when we query any uncovered field ( not available in the index), the data needs to be fetched from data nodes - if the resident ratio is less, there is higher number of disk reads resulting in a performance degradation

Backfill Limit

  • When the query has a very large index scan and is not able to fit the results in the query services's memory, it buffers the results in the configured temp directory for the Query Service. By default, this is just /tmp, with a default quota of 5120MB.

Actions taken to resolve

New Cluster

We have created a new cluster in Frankfurt region with the following configuration:

6 data nodes  - r5.4xlarge types 
2 index nodes - r5.4xlarge types 
2 query nodes - c5.4xlarge types

The Right Indexes

A partial index for accounts in Savings scheme and ensuring customerId is covered in the index

CREATE INDEX savings_account_customers_index ON <bucket> (type,schemeId,customerId) 
WHERE (schemeId = "savings" AND type="account") WITH { "num_replica":1, "defer_build":true }` 

Similarly one for Current customers

CREATE INDEX current_account_customers_index ON <bucket> (type,schemeId,customerId) 
WHERE (schemeId = "current" AND type="account") WITH { "num_replica":1, "defer_build":true }` 

BUILD INDEX ON `<bucket>`(`savings_account_customers_index`,`current_account_customers_index` ) USING GSI;

https://blog.couchbase.com/deferring-of-index-creation/

Backfill Limit

The backfill limit is changed to -1 ( unlimited ) from the admin console. https://docs.couchbase.com/server/5.5/settings/backfill.html

Index Scan Timeout

Any query using an index scan is defaulted to timeout in 2 mins. This and the other indexer settings can be seen using

curl -X GET -u Administrator:<password> http://<index-node-ip>:9102/settings 

The index scan timeout has been changed to 30 mins using

curl -X POST -u Administrator:<password> http://<index-node-ip>:9102/settings --data '{"indexer.settings.scan_timeout": <timeout-in-milliseconds>}'

https://forums.couchbase.com/t/index-scan-timed-out/9874/3

Data Extraction

We are now able to extract the customerIds in under 9 minutes using the below curl (running it within AWS ):

curl http://<query-node-ip>:8093/query/service -u Administrator:<password> -d 'statement=<your query statement>

Fetching it from outside AWS takes quite a long time to stream the data out and will timeout at the set index-scan-timeout value. As the objective is to extract data into s3, it doesn't make sense to query from outside AWS ( on-premise spark cluster as planned earlier).

⚠️ **GitHub.com Fallback** ⚠️