1. Monitoring vs Observability - duttdev489/Datadog GitHub Wiki

Monitoring

Collecting data to know what is happening with the system, application or infrastructure?

Observability

Find out why is it happening (from the data we have collected from monitoring).

Three Pillars of Observability

Metrics

These data points are numerical values that can track anything about your environment over time

Logs

A file contained time stamped information about usage of the system

Traces

Used to track the time spent by an application processing a request and the status of this request.

What is a Metric?

numerical values that can track anything about your environment over time, from latency to error rates to user signups.

Reason to collect Metrics

1. Baseline for Operation

Metrics can tell us what normal looks like for our applications. Without metrics, we are stuck guessing what's going on.

2. Reactive Responses

Using metrics we don't have to wait until a customer reports an outage. We can react to issues in our environment before they snowball.

3. Proactive Responses

Why wait for something to go wrong? By looking at metrics can get ahead of problem before they happen.

What does this matter?

Spend less money
Less outages and reduced ticket count
Happy customers and management

What is Monitoring?

Monitoring is the act of paying attention to the patterns that your metrics are telling you.
Its about analyzing your data and acting on it.

What do we Monitor?

1. Performance

By watching performance we can watch how our architecture and applications are using the resources that are available.

2. Security

Is something going wrong in our environment? Creating monitors around security metrics can stop incidents in their tracks.

3. Usage

What are user doing in our environment? Are they interacting with our products?

Whom do we Alert?

Alerts are simply setting a threshold in a monitor. When that threshold is breached, a notification is sent to the designated recipient.

Alert Fatigue

Alert Fatigue arises from over alerting. It's important to only alert team members when something actionable needs to be done.

What is a Log?

A log is a computer generated file that contains information regarding the usage of a system.

This gives you insight into the behavior of the resource.

Why do we collect logs?

Compliance

Standards that the business is held to might dictate which logs you'll need to store and for how long you need to store them.

Insight

Logs can give you insight into application and system performance that metrics by themselves might not be able to provide.

Security/Audit

This is priority for businesses. Logs are needed to demonstrate that only authorized activities are going on inside of a system.

Practical uses for Logs

Troubleshooting
Auditing
Monitoring
Alerting
Personal history

What kind of services generate logs?

Servers
Containers
Cloud Services
Mobile/IOT Devices
Web Browsers
Serverless function

Almost everything generate logs

How long do we store logs?

Compliance

Standard that the business is held to might dictate how long your logs need to be stored.

Usefulness

Some logs are more useful than others. It's up to you to decide which logs need to be stored for whatever length of time is useful to you.

Cost

Storage costs money. Depending on the services you're using, you'll want to keep in mind your budget when deciding on storage length.

Using Datadog to logging

Traces

What is a Trace?

A trace is used to track the time spent by the application processing a request along with the execution path taken.

What is a Span?

Individual unit of work that code is doing.

Why do we collect traces?

1. Microservices

As businesses migrate away from Monolithic architecture, tracing is needed to figure out what all of the microservices are up to.

2. Optimization

Tracing allows you to optimize the performance of your applications by identifying bottlenecks in the calls being made.

3. Troubleshooting

When something goes wrong we need insight into the actual application code. This can assist us in tracking down errors with the code.