couchbase indexing and querying performance - sreeragravindran/references GitHub Wiki
In order to reconcile data with multiple systems, we need to be able to pull the entire list of customerIds ( to start with ) from our couchbase cluster.
select customerId from <bucket> where type='account' and scheme='savings'
This is a heavy workload on the current cluster and running this chokes the query node out of the cluster
- we do not have indexes that cover the required field ( customerId in this case )
- there is only an index on
type
- creating right indexes ( https://blog.couchbase.com/create-right-index-get-right-performance/ )
- The index nodes are advised to be run on separate memory intensive machines and query nodes on compute intensive ones, so these can be scaled according to the querying demands as and when necessary.
- best practices ( https://blog.couchbase.com/indexing-best-practices/ )
- the resident ratio is the amount of data residing in memory as compared to disk. In our cluster this is only 10-20%. The issue with this is:
when we query any uncovered field ( not available in the index), the data needs to be fetched from data nodes - if the resident ratio is less, there is higher number of disk reads resulting in a performance degradation
- When the query has a very large index scan and is not able to fit the results in the query services's memory, it buffers the results in the configured temp directory for the Query Service. By default, this is just /tmp, with a default quota of 5120MB.
The issue diagnosis : https://support.couchbase.com/hc/en-us/requests/26625
We have created a new cluster in Frankfurt region with the following configuration:
6 data nodes - r5.4xlarge types
2 index nodes - r5.4xlarge types
2 query nodes - c5.4xlarge types
A partial index for accounts in Savings scheme and ensuring customerId
is covered in the index
CREATE INDEX savings_account_customers_index ON <bucket> (type,schemeId,customerId)
WHERE (schemeId = "savings" AND type="account") WITH { "num_replica":1, "defer_build":true }`
Similarly one for Current customers
CREATE INDEX current_account_customers_index ON <bucket> (type,schemeId,customerId)
WHERE (schemeId = "current" AND type="account") WITH { "num_replica":1, "defer_build":true }`
BUILD INDEX ON `<bucket>`(`savings_account_customers_index`,`current_account_customers_index` ) USING GSI;
https://blog.couchbase.com/deferring-of-index-creation/
The backfill limit is changed to -1 ( unlimited ) from the admin console. https://docs.couchbase.com/server/5.5/settings/backfill.html
Any query using an index scan is defaulted to timeout in 2 mins. This and the other indexer settings can be seen using
curl -X GET -u Administrator:<password> http://<index-node-ip>:9102/settings
The index scan timeout has been changed to 30 mins using
curl -X POST -u Administrator:<password> http://<index-node-ip>:9102/settings --data '{"indexer.settings.scan_timeout": <timeout-in-milliseconds>}'
https://forums.couchbase.com/t/index-scan-timed-out/9874/3
We are now able to extract the customerIds in under 9 minutes using the below curl (running it within AWS ):
curl http://<query-node-ip>:8093/query/service -u Administrator:<password> -d 'statement=<your query statement>
Fetching it from outside AWS takes quite a long time to stream the data out and will timeout at the set index-scan-timeout
value. As the objective is to extract data into s3, it doesn't make sense to query from outside AWS ( on-premise spark cluster as planned earlier).