Monitor Cloud Resources - VishalPatangay/My-devops-repo GitHub Wiki

https://docs.microsoft.com/en-us/learn/modules/cmu-monitor-cloud-resources/

Reliability and availability are the product of careful design, and they must be ensured by constant monitoring that alerts administrators when problems occur (or preferably before they occur). Monitoring is as important in mission-critical solutions deployed to the cloud as the solutions themselves. Without it, you don't know whether a solution is serving its users' needs.

Monitoring can trigger actions such as increasing the number of virtual machines to handle increased workloads or notifying an administrator of a condition that warrants attention. Because it's not reasonable to expect human operators to monitor systems 24x7, monitoring is automated through software. That software can come from third parties, or it can come from the cloud platform itself.

Learning objectives Explain the main argument in favor of continual and consistent monitoring and oversight in cloud-based IT systems Describe the three types of quantitative measurements that play a role in monitoring Understand the mechanics of monitoring platforms that utilize agents to collect and report back information Understand the mechanics of monitoring platforms that rely upon pre-existing sources of information such as service logs to analyze performance Understand the metrics that are the most beneficial to monitoring Learn how measurements are used to judge performance levels Describe the justification for problem ticketing Describe what KPIs are and how they differ from metrics Discuss the concept of “everyday remediation”

Monitoring is very important is assessing the application status, instrumentation which helps in monitoring the software is in 3 forms:

  1. Logs: Are permanent immutable set of record of events stored. An event could be Change of status of a component from busy to available Completion of a task An error in the system

Log Monitoring system should be capable of providing the below information: Correlation - Joining relevant recorded events together into a single view so that managers are not forced to scan raw data for potentially pertinent information

Normalization - Reducing the volume of recorded data in the database, as well as in administrators' views of the database, into more manageable volumes

Reporting - Presenting a graphical, informational view of events that is intelligible to someone working outside the realm of day-to-day IT operations

  1. Metrics: Values that represent health, stability and availability of the system

  2. Traces: are the paths of execution of a program.

Users are complaining that a Web site you deployed to the cloud has suddenly become very slow. Which of the following metrics might be helpful to you in resolving the problem?Average wait time requests wait in a queue. Correct! If a Web site seems slow, it is often because the Web server is receiving more traffic than it can handle. Request wait time is a common metric used to determine whether a server is overloaded.

https://docs.microsoft.com/en-us/learn/modules/cmu-monitor-cloud-resources/2-monitoring-platforms