2.2 Introduction to Metrics - duttdev489/Datadog GitHub Wiki

Metrics

smallest unit in the Datadog universe but they provide enormous insight into infrastructure when they are visualized, measured, and monitored.
Metrics are numerical measurements about any aspect of your system over a period of time, such as latency, error rates, and user registrations.
In Datadog, metric data is received and retained as data points that include a value and timestamp.

Monitors send notifications when metrics fall outside of the tolerances that you define.

Service Level Objectives track metrics over long periods of time to help you define quality standards.

Sending metrics to Datadog

Metrics can be sent to Datadog from several sources:

The Datadog Agent automatically sends many standard metrics, such as avg:system.cpu.user: 18.63 or system.disk.read_time: 42.
Datadog integrations include metrics out of the box.
Metrics can be generated within the Datadog platform.
You can create custom metrics related to your business and submit them through the Agent, DogStatsD, or the HTTP API.

Metric Types

Datadog offers metric types that apply to specific purposes: count, rate, gauge, histogram, and distribution. These metric types determine the available graphs and functions that can be utilized with the metric within Datadog.

Count

Adds up the values received within a specified time interval. For example, 2000 HTTP requests.

Rate

Divides the count by the duration of the time interval.
Using the same example mentioned above, 0.566 HTTP requests per second.

Gauge

Reports the last value received during the specific time interval
This metric type would be appropriate for monitoring the usage of RAM or CPU, since the last value gives an accurate representation of the host’s behavior during the timeframe: 2097152 bytes of RAM.

Histogram

Summarizes the submitted values into five different values: the mean, count, median, 95th percentile, and maximum.
This generates five distinct timeseries.
For example, this metric type is useful for measuring latency, where it is inadequate to only know the average value.
Histograms enable you to understand how the data is distributed without recording every single data point.

Distribution

Summarizes the values submitted within a time interval across all the hosts in your environment.
Distributions provide enhanced query functionality and configuration options that aren’t offered with other metric types.

Monitors

Monitors will continuously check metrics, integration availability, network endpoints, and other vital aspects for conditions that you define.
When a threshold is exceeded, Datadog will notify you or your team on the Datadog mobile app, by email, or on your chat platform.
This capability is essential for complete infrastructure visibility in one centralized location.

Datadog offers a variety of monitors to cater to different monitoring needs. Some of the most popular types of monitors include:

Metric Monitors

Used to track and alert on specific metrics.
This type of monitor is commonly used to track things like server CPU usage, memory utilization, or network bandwidth.
For example, Send a notification to the DevOps team if average HTTP response time exceeds 1.50 seconds.

Service Checks

Used to verify the status and availability of various services, such as databases or APIs.
Service Checks can be used to monitor both external and internal services. For example, is the redis service healthy?

APM Monitors

Used to monitor application performance, track errors, and measure latency.
This monitor provides insights into how applications are executing and helps to identify potential issues.
For example, alert the development team with an application’s Errors per second is above 25.

Synthetic Monitors

Used to monitor critical user flows and business transactions.
Synthetic API tests and browser tests help you proactively identify issues in application endpoints and key business workflows before your end users encounter them.
Synthetics monitors can simulate user actions, such as logging in or adding an item to a shopping cart, and alert you if there are any errors or delays.

Log Monitors

Used to monitor specific keywords or patterns in log data. Log monitors can track security issues, application errors, or system events.

Service Level Objective (SLO)

Service Level Indicators (SLIs) are metrics that are used to measure some aspect of the level of service that is being provided.

For example, 10 errors per second on the storedog-payments service.
SLIs that are vital to your organization’s success can be monitored over time as Service Level Objectives (SLO).

Tracking these metrics as SLOs establishes clear targets for service quality, enabling you to measure progress and make improvements over time. For example, the service has an SLO of 99% of requests being successful over the past 7 days or less than 1 second latency 99% of the time over the past 30 days.

This approach enables site reliability engineers, frontend developers, and even product managers to maintain a consistent customer experience, balance feature development with platform stability, and enhance communication with both internal and external users. Your team will be focused on the metrics that matter most and you’ll be consistently delivering a high level of service to your customers. In essence, SLOs provide a roadmap for defining, measuring, and improving service quality, ultimately leading to a more robust and reliable system.