GCS Monitoring - ghdrako/doc_snipets GitHub Wiki

Monitoring cloud services and analyzing logs

The primary components of a monitoring solution on GCP are Workspaces and the Cloud Monitoring and Cloud Logging services.

Cloud Monitoring

Cloud Monitoring is a collection of tools that help realize the purposes of monitoring. A collection of these measurements is generically defined as a metric, of which there are over 1,500 types of in Monitoring (which include metrics for Google Cloud, AWS, and supported third-party software). Cloud Monitoring is also where you can set up notification alerts when measurements deviate from what you define as normal and acceptable.

Monitoring capabilities can be categorized into four kinds:

  • Black-box monitoring
  • White-box monitoring
  • Gray-box monitoring
  • Logs-based metrics monitoring

Cloud Logging

Cloud Logging is the service that allows you to store, search, analyze, and alert on logging data and events from both Google Cloud and AWS platforms. Cloud Logging also includes access to the partner service BindPlane (https://bluemedora.com/ products/bindplane/bindplane-for-stackdriver/), which can be used to collect logs from over 150 common application components, in Google Cloud or elsewhere.

The Logging service encompasses the following four capabilities:

  • Collection - The automatic collection of logs from Google Cloud services.
  • Analysis - Real-time log data analysis with tools such as Logs Explorer, Dataflow, and BigQuery. Archived logs from Cloud Storage can also be analyzed.
  • Export Export logs to Cloud Storage or stream to Cloud Pub/Sub or BigQuery.Logs-based metrics can be exported to the Monitoring service.
  • Retention - : Access logs can be retained for up to 3,650 days (in logs buckets with a configurable retention period) and admin logs for 400 days. Logs exported to Cloud Storage or BigQuery can have a longer retention period configured in those services.

Cloud Logging handles the following main types of logs:

  • Audit logs: Data access logs, admin activity logs, and essentially anything that answers the question "who did what, where and when"?
  • Agent logs: Logs collected by the logging agents and common third-party applications.
  • Network Logs: Logs related to firewall rules, VPC network traffic, and other networking services.

Ops Agent

Cloud operations monitoring agents - gathers system and application metrics from virtual machine instances and sends them to Monitoring. By default, the Monitoring agent collects disk, CPU, network, and process metrics.

Uptime checks

Validate whether the service is up or down by trying to reach (from at least three different locations) its exposed URL, IP address, or DNS name. These checks also measure and display the latency associated with the responses.

A public uptime check can issue requests from multiple locations throughout the world to publicly available URLs or Google Cloud resources to see whether the resource responds.

Public uptime checks can determine the availability of the following monitored resources:

To succeed, these conditions must be met:

  • The HTTP status is Success.
  • The data has no required content or the required content is present.

Alerting

Metrics Explorer

MQL PromQL

Open Census

Cloud Monitoring

Metrics

Custom metrics

Monitoring Query Language