Metric Sources - flipkart-incubator/ottoscalr GitHub Wiki

Metrics Required

The following metrics should be present in your Promql complaint metrics source for ottoscalr to function:

node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate - This is a prometheus recording rule. The metric calculates the per-second rate of change of the total CPU usage (container_cpu_usage_seconds_total) for a specific set of containers identified by their node, namespace, pod, and container labels. The sum function aggregates this rate across all instances that match the specified dimensions.
namespace_workload_pod:kube_pod_owner:relabel: - This is a prometheus recording rule. This metric relabels owner_name label to workload in kube_pod_owner (Information about the Pod's owner) metric.
cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits: - This is a prometheus recording rule. The metric provides sum of resource limits for all the containers for a pod in a namespace which is active.
kube_replicaset_status_ready_replicas - Availabe with kube state metrics. This metric provides information about the number of replicas (pods) that are currently in a Ready state for a given ReplicaSet.
kube_replicaset_owner - Availabe with kube state metrics. This metric provides information about the owner of the replicaset.
kube_horizontalpodautoscaler_spec_max_replicas - Availabe with kube state metrics. This metric provides information about the max replicas set on an HPA.
kube_horizontalpodautoscaler_info - Availabe with kube state metrics. Provides information about a particular HPA.
kube_pod_created - Availabe with kube state metrics. Time when the pod was created.
kube_pod_status_ready_time - Availabe with kube state metrics. Time when the pod became ready.

Promql Queries

CPU Utilization Data: Fetches overall cpu usage for a particular deployment in a namespace.

sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{namespace="<namespace-name>"} * on (namespace,pod) group_left(workload, workload_type)(namespace_workload_pod:kube_pod_owner:relabel{namespace="<namespace-name>", workload="<deployment-name>",workload_type="deployment"})) by(namespace, workload, workload_type)

CPU Redline Breach Utilization Data: Fetches cpu utilization datapoints that breach the redline utilization

(sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate{namespace="<namespace-name>"} * on(namespace,pod) group_left(workload, workload_type) namespace_workload_pod:kube_pod_owner:relabel{namespace="<namespace-name>", workload="<workload-name>", workload_type="deployment"}) by (namespace, workload, workload_type)/ on (namespace, workload, workload_type) group_left sum(cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits{namespace="<namespace-name>"} * on(namespace,pod) group_left(workload, workload_type)namespace_workload_pod:kube_pod_owner:relabel{namespace="<namespace-name>", workload="<workload-name>", workload_type="deployment"}) by (namespace, workload, workload_type) > 0.75) and on(namespace, workload) label_replace(sum(kube_replicaset_status_ready_replicas{namespace="<namespace-name>"} * on(replicaset) group_left(namespace, owner_kind, owner_name) kube_replicaset_owner{namespace="<namespace-name>", owner_kind="<workload-type(Deployment || Rollout)>", owner_name="<workload-name>"}) by (namespace, owner_kind, owner_name) < on(namespace, owner_kind, owner_name) (kube_horizontalpodautoscaler_spec_max_replicas{namespace="<namespace-name>"} * on(namespace, horizontalpodautoscaler) group_left(owner_kind, owner_name) label_replace(label_replace(kube_horizontalpodautoscaler_info{namespace="<namespace-name>", scaletargetref_kind="<workload-type(Deployment || Rollout)>", scaletargetref_name="<workload-name>"},"owner_kind", "$1", "scaletargetref_kind", "(.*)"), "owner_name", "$1", "scaletargetref_name", "(.*)")),"workload", "$1", "owner_name", "(.*)")

ACL data: Fetches the Pod Ready Latency data to calculate ACL. We take median of the pod ready latency of all the pods for a deployment.

quantile(0.5,(kube_pod_status_ready_time{namespace="<namespace-name>"} - on (namespace,pod) (kube_pod_created{namespace="<namespace-name>"}))  * on (namespace,pod) group_left(workload, workload_type)(namespace_workload_pod:kube_pod_owner:relabel{namespace="<namespace-name>", workload="<deployment-name>", workload_type="deployment"}))