GCS Monitoring - ghdrako/doc_snipets GitHub Wiki
Monitoring cloud services and analyzing logs
The primary components of a monitoring solution on GCP are Workspaces and the Cloud Monitoring and Cloud Logging services.
Cloud Monitoring
Cloud Monitoring is a collection of tools that help realize the purposes of monitoring. A collection of these measurements is generically defined as a metric, of which there are over 1,500 types of in Monitoring (which include metrics for Google Cloud, AWS, and supported third-party software). Cloud Monitoring is also where you can set up notification alerts when measurements deviate from what you define as normal and acceptable.
Monitoring capabilities can be categorized into four kinds:
- Black-box monitoring
- White-box monitoring
- Gray-box monitoring
- Logs-based metrics monitoring
Cloud Logging
Cloud Logging is the service that allows you to store, search, analyze, and alert on logging data and events from both Google Cloud and AWS platforms. Cloud Logging also includes access to the partner service BindPlane (https://bluemedora.com/ products/bindplane/bindplane-for-stackdriver/), which can be used to collect logs from over 150 common application components, in Google Cloud or elsewhere.
The Logging service encompasses the following four capabilities:
- Collection - The automatic collection of logs from Google Cloud services.
- Analysis - Real-time log data analysis with tools such as Logs Explorer, Dataflow, and BigQuery. Archived logs from Cloud Storage can also be analyzed.
- Export Export logs to Cloud Storage or stream to Cloud Pub/Sub or BigQuery.Logs-based metrics can be exported to the Monitoring service.
- Retention - : Access logs can be retained for up to 3,650 days (in logs buckets with a configurable retention period) and admin logs for 400 days. Logs exported to Cloud Storage or BigQuery can have a longer retention period configured in those services.
Cloud Logging handles the following main types of logs:
- Audit logs: Data access logs, admin activity logs, and essentially anything that answers the question "who did what, where and when"?
- Agent logs: Logs collected by the logging agents and common third-party applications.
- Network Logs: Logs related to firewall rules, VPC network traffic, and other networking services.
Ops Agent
Cloud operations monitoring agents - gathers system and application metrics from virtual machine instances and sends them to Monitoring. By default, the Monitoring agent collects disk, CPU, network, and process metrics.
Uptime checks
Validate whether the service is up or down by trying to reach (from at least three different locations) its exposed URL, IP address, or DNS name. These checks also measure and display the latency associated with the responses.
A public uptime check can issue requests from multiple locations throughout the world to publicly available URLs or Google Cloud resources to see whether the resource responds.
Public uptime checks can determine the availability of the following monitored resources:
- Uptime check URL
- VM instance
- App Engine application
- Kubernetes service
- Amazon Elastic Compute Cloud (EC2) instance
- Amazon Elastic load balancer
To succeed, these conditions must be met:
- The HTTP status is Success.
- The data has no required content or the required content is present.