GCP operations formely Stackdriver - ghdrako/doc_snipets GitHub Wiki

GCP operations

Operations suite (formerly Stackdriver), which provides a suite of monitoring and troubleshooting services that will help operation teams to have everything under control.

It includes the following services:

  • Cloud Logging This is a fully managed service that centralizes the log management of your cloud infrastructure.
  • Cloud Monitoring This is a fully managed service that centralizes metrics collection and provides performance and health visibility of your cloud infrastructure.
  • Cloud Trace This is a fully managed service to collect application latency data and provides near-real-time application performance information.
  • Cloud Profiler This is a fully managed service to measure code performance and spot CPU and memory-intense processes inside your application.
  • Cloud Debugger This is a fully managed service to inspect the state of a running application without stopping it.
  • Error Reporting

Costs

https://cloud.google.com/monitoring#pricing.

Free incur costs once monthly limits have been exceeded
Cloud Debugger Cloud Logging
Error Reporting Cloud Monitoring
Cloud Profiler Cloud Trace

Google Cloud operations suite is also capable of monitoring logging data for your AWS accounts. You must associate your AWS resources with a Google Cloud project. This project serves as a connector to AWS.

Cloud Monitoring

  • Groups

Resources such as VM instances, applications, and databases can be grouped into logical groups. This allows us to manage them together and display them in dashboards.

  • Dashboards

Dashboards allow us to give visibility to different metrics in a single pane of glass. We can create multiple dashboards that contain charts based on predefined or user-defined metrics.

  • Alerting policies

Alerting policies can be configured in order to create notifications when event and metric thresholds are reached. The policies can have one or more conditions to trigger the alert and will create an incident that is visible in the Cloud Monitoring console.

  • Uptime checks

Uptime checks are used for checking the availability of your services from different locations around the globe. They can be combined with alerting policies and are displayed in the dashboards. Checks can be done using HTTP, HTTPS, or TCP, and are possible for URLs, App Engine, Elastic Load Balancing (ELB), Kubernetes' load balancer service, and Amazon EC2 and GCE instances. The probing interval can be set to 1, 5, 10, or 15 minutes

Remember that, for the uptime check to work, the firewall rules need to be created. To check the IPs of the uptime servers, go to the uptime check console and download the list of rules. https://cloud.google.com/monitoring/uptime-checks/

  • Monitoring agents

To get more out of Cloud Monitoring, a Monitoring agent can be installed on the instance to collect additional metrics. By default, the Monitoring agent collects disk, CPU, network, and process metrics; however, additional metrics can also be collected. The Monitoring agent is a collectd-based agent that can be installed both on GCP and AWS instances. The agent can be also configured to monitor many applications, including the Apache web server, Tomcat, Kafka, Memcached, and Redis. The installation of the agent on Linux is very straightforward and requires the following two commands to be executed:

curl -sSO https://dl.google.com/cloudagents/add-monitoring-agent-repo.sh
sudo bash add-monitoring-agent-repo.sh --also-install

To monitor the instance memory, you need to use the Monitoring agent!

Cloud Logging

It allows you to store and analyze logs, as well as events coming from GCP and AWS. Based on the logs, alerts can be created. It also provides a robust API, allowing logs to be both managed and injected. This means that any third-party application can leverage Google Cloud's operations suite for logging purposes. The gathered logs are visible in the Legacy Logs Viewer, where they can be filtered and exported for further analysis or archival purposes, or integrated with third- party solutions.

  • Legacy Logs Viewer

old from stackdriver but still avaliable

  • Logs Explorer

With the Logs Explorer, you can choose to view the logs in two different scopes:

  • Scope by project – allows you to search and view logs from a single project

  • Scope by storage – allows you to view logs from a bucket that was used as a sink

  • Exporting logs

Log entries that are received by logging can be exported (copied) to Cloud Storage buckets, BigQuery data sets, and Cloud Pub/Sub topics. You export logs by configuring log sinks, which then continue to export log entries as they arrive in logging. A sink includes a destination and a filter that selects the log entries to be exported. Remember that only the logs that were created after the sink configuration will be exported

  • Logging agent

The Logging agent is an application that is based on fluentd, and both Linux and Windows machines are supported. It allows the streaming of logs from common third- party applications and system software to Cloud Logging. The agent is included in the images for App Engine and GKE. For Compute Engine and Amazon EC2, it needs to be installed. Installation of the agent on Linux is very simple, and requires the following two commands to be executed:

curl -sSO https://dl.google.com/cloudagents/add-logging-agent-repo.sh
sudo bash add-logging-agent-repo.sh --also-install

By default, the agent streams logs for predefined applications. google-fluentd.conf can be modified to indicate additional logs that should be streamed.

  • Ops Agent

Single agent for both logging and monitoring

  • Log-based metrics

Logs can be used to create log-based metrics. Cloud Logging can accumulate logs that are defined by the filter every time a match appears. This data is then exposed to Monitoring and can be used further to create dashboards and alert policies. As an example, logs containing a particular 404 error message can be counted during a period of 1 minute and exposed as a metric. The log-based metric can either be system metrics or user-defined:

  • System metrics: These are predefined by Cloud Logging.

  • User-defined metrics: These metrics are created by a user on a project-by-project basis, based on the filtering criteria.

  • Cloud Audit Logs

The logs are stored per project, folder, or organization, and are of the following types:

  • Admin Activity (default enable)

  • System Event (default enable)

  • Data Access

  • Policy Denied The first two are enabled by default and cannot be deactivated. The third one is disabled by default, as it can generate a massive amount of information. Audit logs are generated for most of the GCP services.

  • Activity

The audit logs can be also viewed from the ACTIVITY tab in the main GCP console screen, which is outside the Google Cloud operations suite console.

  • Retention

Retention defines how long the logs are stored in Cloud Logging. After the stipulated period, the logs are removed. Depending on the log types, the retention time differs. Refer to the following list of log types and their retention periods:

  • Admin Activity: 400 days
  • Data Access: 30 days
  • System Event: 400 days
  • Policy Denied: 30 days Note that the logs can be exported and archived for longer periods.

Google Cloud's operations suite for GKE

Cloud Operations for GKE provides observability capabilities for GKE clusters both at the cluster and workload level. It shows you all the most important GKE cluster resources and allows you to drill down to the logs generated by the workload containers. For all new GKE clusters, the feature is enabled by default. You have, however, a choice to decide what level of monitoring and logging you want to have. Note you can either limit it to the logging and monitoring of the system (a Kubernetes cluster) only or also include the workload (application). All options are visible in the GKE cluster provisioning wizard after clicking on the dropdown in the Select logging and monitoring type form. Those options available are as follows:

  • System and workload logging and monitoring
  • System logging and monitoring only (beta)
  • System and workload logging only (Monitoring disabled)
  • System monitoring only (Logging disabled):

APM

APM is a set of tools that developers use to give them some insight into how fast and how reliably they can run an application. It consists of three services:

  • Trace
  • Debugger
  • Profiler These tools are integrated into the code of the application. The application does not need to be hosted on GCP but can run in any cloud or even on-premises, as long as connectivity is available.

Trace

Cloud Trace allows you to track latencies in your microservices application. It shows you the overall time of the application responses but can also show detailed delays for each of the microservices. This allows you to pinpoint the root cause of the latency. The traces are displayed in the GCP console, and analysis reports can be generated. By default, it is installed on Google App Engine (GAE) standard, but it can be used with GCE, GKE, GAE flexible, and non-GCP machines. The tracing mechanism needs to be incorporated using the Cloud Trace SDK or API.

Debugger

This allows you to debug errors in the code of your application, without stopping the application. Developers can request a real-time snapshot of a running application, capturing the call stack and local variables. Debug log points can be injected into the code to display additional information. These can even be done in production, without affecting the end users. By default, it is installed on GAE standard, but it can be used with GCE, GKE, GAE flexible, and non-GCP machines. It does not require a Logging agent.

Profiler

Cloud Profiler shows you how many resources your code consumes. With the changes in the code in your application, there may be an unexpected rise in the demand for resources. Profiler allows you to pinpoint those issues, even in production. It uses a piece of code called the profiler agent that is attached to the main code of your application, and it periodically sends information on resource usage. It currently supports Java, Go, and Node.js, and can be used with GCE, GKE, GAE flexible, and non-GCP machines.

Error Reporting

Error Reporting allows you to collect and aggregate errors that are produced by your applications in a single place. The collected errors can be grouped and displayed in a centralized interface. This way, you can see how many crashes have occurred over a specific time period.

The service works in a very similar way to Cloud Logging, but it allows you to filter only the most important errors and pinpoint the root cause of the crash. Error Reporting works with Cloud Functions, App Engine, GCE, GKE, and AWS EC2. It is, by default, enabled for the App Engine standard environment. Multiple languages, such as Go, Java, .NET, Node. js, PHP, Python, and Ruby are supported. There are two ways to leverage error reporting:

  • You can use the Cloud Logging API and send properly formatted error messages.
  • You can call the dedicated Error Reporting API. Information in Error Reporting is retained for 30 days.

VPC Flow Logs

When you **enable VPC Flow Logs on your VPC subnet **, you will be able to collect network traffic samples that are sent or received by Compute Engine instances or Google Kubernetes Engine (GKE) nodes. These logs are stored in Cloud Logging and can be used for network monitoring, network forensics, security analysis, and much more. Flow logs are aggregated by connection, and you can export them for further analysis.

VPC Flow Logs does not affect performance. However, a large number of logs may be generated, which can result in increased costs.

Firewall Rules Logging

This is an important operation for keeping your infrastructure secure and monitored. In GCP, you can enable firewall logging inside each firewall rule.

VPC audit logs