Well architect framework - vidyasekaran/GCP GitHub Wiki
https://cloud.google.com/blog/products/cloud-migration/new-google-cloud-architecture-framework-guide
GCP Architectural Framework
GCP Architectural framework provides
- The architecture best practices 2 Guidance on product and service to aid in application design process.
Also this framework provides a foundation for building and improving your google cloud deployments using 4 principles:
Operational Excellence - Provides design choices as to how to improve your operational excellence. These include approaches for automating the build process, implementing the monitoring, and disaster recovery plan.
Security, privacy and compliance - Guidance on various security controls you can use and list of products and services best suited to support security needs of your deployment.
Reliability - How to build reliable and highly available solutions. Recommendation include define reliability goals, improve on observability (including monitoring) establish incident management functions and techniques to measure and reduce operational burden on your teams.
Performance Cost Optimization - Suggestions to tune your application for better user experience and analyze cost on google cloud while maintaining acceptable level of service.
Each section provide you strategies, best practices, design questions and recommendations you can use this while evaluating design choices across various product to implement security and reliability into your design.
How to use the framework
We recommend reviewing the “System Design Considerations” first and then dive into other specific sections based on your needs.
Discover: Use the framework as a discovery guide for GCP offerings and learn how various pieces fit together to build solutions.
Evaluate: Use design questions outlined in each section to guide your thought process while thinking about your system design. If you’re unable to answer the design question, you can review the highlighted Google Cloud services and features to address them.
Review: If you’re already on Google Cloud, use the recommendations section to verify if you are following best practices or as a pulse check to review before deploying to production.
The framework is modular so you can pick and choose sections most relevant to you, but we recommend reading all of the sections, because why not!
https://cloud.google.com/architecture/framework/performance-cost-optimization
Performance Cost Optimization -Evaluate performance requirement - identify what min perf is required
Use scalable design - improve scalability and performance with auto scaling, compute choices and storage.
Identify and implement cost-saving approaches - Evaluate cost of each running services while associating priority for service availability and cost.
Best Practices
Use auto scaling and data processing Use GPU and TPUs to increase performance. Identify apps to tune Analyze your cost and optimize
https://cloud.google.com/architecture/framework/performance-cost-optimization#autoscaling
Compute Engine autoscaling -https://cloud.google.com/architecture/framework/performance-cost-optimization#gke_autoscaling
Google Kubernetes Engine autoscaling -Use cluster autoscaler
https://cloud.google.com/architecture/framework/performance-cost-optimization#serverless_autoscaling
Serverless autoscaling -Serverless compute options include Cloud Run, App Engine, and Cloud Functions, each of which provides autoscaling capabilities. Use these serverless options to scale your microservices or functions.
https://cloud.google.com/architecture/framework/performance-cost-optimization#data_processing
Data processingDataproc and Dataflow offer autoscaling options to scale your data pipelines and data processing. Use these options to allow your pipelines to access more computing resources based on the processing load.
https://cloud.google.com/architecture/framework/performance-cost-optimization#use_gpus_and_tpus_to_increase_performance
Use GPUs and TPUs to increase performance -https://cloud.google.com/architecture/framework/performance-cost-optimization#identify_apps_to_tune
Identify apps to tune -Application Performance Management (APM) includes tools to help you reduce latency and cost, so that you can run more efficient applications. With Cloud Trace, Cloud Debugger, and Cloud Profiler, you gain insight into how your code and services function, and you can troubleshoot if needed.
Cloud Trace - Cloud Trace is a distributed tracing system for Google Cloud that collects latency data from applications and displays it in near real-time in the Google Cloud Console.
https://cloud.google.com/trace/docs
Debugging- Cloud Debugger helps you inspect and analyze your production code behavior in real time without affecting its performance or slowing it down.
Profiling - Poorly performing code increases the latency and cost of applications and web services. Cloud Profiler helps you identify and address performance by continuously analyzing the performance of CPU or memory-intensive functions executed across an application.
https://cloud.google.com/architecture/framework/performance-cost-optimization#profiling
https://cloud.google.com/architecture/framework/performance-cost-optimization#instrumentation
Instrumentation -Latency plays a big role in determining your users' experience. When your application backend starts getting complex or you start adopting microservice architecture, it's challenging to identify latencies between inter-service communication or identify bottlenecks. Cloud Trace and OpenTelemetry tools help you scale collecting latency data from deployments and quickly analyze it.
Debugging - Cloud Debugger helps you inspect and analyze your production code behavior in real time without affecting its performance or slowing it down.
https://cloud.google.com/architecture/framework/performance-cost-optimization#profiling
Profiling -Poorly performing code increases the latency and cost of applications and web services. Cloud Profiler helps you identify and address performance by continuously analyzing the performance of CPU or memory-intensive functions executed across an application.
https://cloud.google.com/architecture/framework/performance-cost-optimization#analyze_your_costs_and_optimize
Analyze your costs and optimize -Google Cloud provides an Export Billing to BigQuery feature that provides a detailed way to analyze your billing data. You can connect BigQuery to Google Data Studio or Looker, or to third-party business intelligence (BI) tools like Tableau or Qlik. Use the programmatic notifications feature to send notifications when your budget exceeds a certain threshold. You can use budget notifications with third-party solution providers as well as customized applications.
Sustained use discounts are automatic discounts for running specific Compute Engine resources for a significant portion of the billing month. Sustained use discount is granted for prolonged usage of certain Compute Engine virtual machine (VM) types.
Committed use discounts are ideal for workloads with predictable resources needs. When you purchase a committed use contract, you purchase a certain amount of vCPUs, memory, GPUs, and local SSDs at a discounted price in return for committing to paying for those resources for 1 year or 3 years.
A Preemptible VM is an instance that you can create and run at a much lower price than normal instances. However, Compute Engine might terminate (that is, preempt) these instances if it requires access to those resources for other tasks. Preemptible instances are excess Compute Engine capacity, so their availability varies with usage.
Google cloud architecture framework overview
https://cloud.google.com/architecture/
Overview
Principles of system design
https://cloud.google.com/architecture/framework/design-considerations
This Google Cloud architecture framework helps you evaluate the advantages and disadvantages of design choices and provides guidance on how to optimize, secure, and tune services while controlling the cost of deployment. The framework describes a foundation for building and improving your deployments using 4 principles:
Operational excellence
Security, privacy, and compliance
Reliability
Performance and cost optimization
Each principle section provides details on strategies, best practices, design questions, recommendations, key Google Cloud services, and links to resources.
Google Cloud system design considerations
-
Geographic zones and regions
-
Resource management
-
Identity and access management
In IAM, you grant access to members. Members can be of the following types.
Google account
Service account
Google group
Google Workspace domain
Cloud Identity domain
Authorization
When an authenticated member attempts to access a resource, IAM checks the resource's IAM policy to determine whether the action is allowed. The entities and concepts involved in the authorization process are described below.
Resources
You can grant access to users for a Google Cloud resource. Some examples of resources are projects, Compute Engine instances, Cloud Storage buckets, and so on.
Permissions
Permissions determine what operations are allowed on a resource. In the IAM world, permissions are represented in the form of service.resource.verb. You don't assign permissions to users directly. Instead, you assign them a role that contains one or more permissions.
Roles
A role is a collection of permissions. When you grant a role to a user, you grant them all the permissions that the role contains. There are three kinds of roles in IAM:
Basic roles. Owner, Editor, and Viewer.
Predefined roles. Predefined roles are IAM roles that give finer-grained access control than basic roles.
Custom roles. Roles that you create to tailor permissions to the needs of your organization when predefined roles don't meet your needs.
IAM policies
You can grant roles to users by creating an IAM policy, which is a collection of statements that define who has what type of access. A policy is attached to a resource and is used to enforce access control whenever that resource is accessed. An IAM policy is represented by the IAM policy object.
Policy hierarchy
You can set an IAM policy at any level in the resource hierarchy: organization, folder, project, or the resource level.
Resources inherit the policies of their parent resource. Set a policy at the organization level to have it automatically inherited by all its children folders and projects. Set a policy at the project level to have it inherited by all the project's child resources. The effective policy for a resource is the union of the policy set at that resource, and the policy inherited from higher up in the hierarchy.
Compute
Most solutions use compute resources in some form, and the selection of compute for your application needs is critical. On Google Cloud, compute is offered as Compute Engine, App Engine, Google Kubernetes Engine (GKE), Cloud Functions, and Cloud Run. You should evaluate your application demands and then choose one of the following compute offerings.
**Choosing right compute **
Networking
Key services
Virtual Private Cloud (VPC)
Cloud Load Balancing
Cloud CDN (Content Delivery Network)
Cloud DNS
Cloud Interconnect
Storage
https://cloud.google.com/blog/topics/developers-practitioners/map-storage-options-google-cloud
Cloud Storage
Persistent Disk
Filestore
Database
Operational excellence
Operational excellence helps you build a foundation for another critical principle, reliability.
Strategies
Use these strategies to achieve operational excellence.
Automate build, test, and deploy. Use continuous integration and continuous deployment (CI/CD) pipelines to build automated testing into your releases. Perform automated integration testing and deployment.
Monitor business objectives metrics. Define, measure, and alert on relevant business metrics.
Conduct disaster recovery testing. Don't wait for a disaster to strike. Instead, periodically verify that your disaster recovery procedures work, and test the processes regularly.
Best practices
Follow these practices to achieve operational excellence.
Increase software development and release velocity.
Monitor for system health and business health.
Plan and design for failures.
The following sections cover the best practices in detail.
Increase development and release velocity
Monitor system health and business health
Design for disaster recovery
Google Cloud Architecture Framework: Reliability
https://cloud.google.com/architecture/framework/reliability
To run a reliable service, your architecture must include the following:
Measurable reliability goals, with deviations that you promptly correct.
Considerations for scalability, high availability, disaster recovery, and automated change management.
Components that self-heal where possible, and code that includes instrumentation for observability.
Operational procedures that run the service with minimal manual work and cognitive load on operators, and that let you rapidly mitigate failures.
In this Architecture Framework category, you learn how to do the following:
Security, privacy, and compliance
https://cloud.google.com/architecture/framework/security-privacy-compliance
This section of the architecture framework discusses how to plan your security controls, approach privacy, and how to work with Google Cloud compliance levels.
Strategies
Use these strategies to help achieve security, privacy, and compliance.
Implement least privilege with identity and authorization controls. Use centralized identity management to implement the principle of least privilege and to set appropriate authorization levels and access policies.
Build a layered security approach. Implement security at each level in your application and infrastructure, applying a defence-in-depth approach. Use the features in each product to limit access. Use encryption.
Automate deployment of sensitive tasks. Take humans out of the workstream by automating deployment and other admin tasks.
Implement security monitoring. Use automated tools to monitor your application and infrastructure. Use automated scanning in your continuous integration and continuous deployment (CI/CD) pipelines, to scan your infrastructure for vulnerabilities and to detect security incidents.
Three control areas focus on mitigating risk:
Technical controls refer to the features and technologies that you use to protect your environment. These include native cloud security controls, such as firewalls and enabling logging, and can also encompass third-party tools and vendors to reinforce or support your security strategy.
Contractual protections refer to the legal commitments made by the cloud vendor around Google Cloud services.
Third-party verifications or attestations refer to having a third party audit the cloud provider to ensure that the provider meets compliance requirements. For example, Google was audited by a third party for ISO 27017 compliance.
You need to assess all three control areas to mitigate risk when you adopt new public cloud services.
Manage authentication and authorization
Grant appropriate roles
A role is a collection of permissions applied to a user. Permissions determine what actions are allowed on a resource and usually correspond with REST methods. For example, a user or a group of users can be granted the compute admin role which allows them to view and edit Compute Engine instances.
Understand when to use service accounts
A service account is a special Google account that belongs to your application or a virtual machine (VM), instead of to an individual end user. Your application uses the service account to call the Google API of a service, so that the users aren't directly involved.
Use Organization Policy Service
IAM focuses on who, and lets the administrator authorize who can take action on specific resources based on permissions. Organization Policy Service focuses on what, and lets the administrator set restrictions on specific resources to determine how they can be configured.
Use Cloud Asset Inventory
This service provides an organization-wide snapshot of your inventory for a wide variety of Google Cloud resources and policies with a single API call. Automation tools can then use the snapshot for monitoring or policy enforcement, or archived for compliance auditing. If you want to analyze changes to the assets, asset inventory also supports exporting metadata history.
Use Policy Intelligence
The recommender, troubleshooter, and validator tools provide helpful recommendations for IAM role assignment, monitor and prevent overly permissive IAM policies, and assist with troubleshooting access control related issues.
Auditing
Use Cloud Audit Logs to regularly audit changes to your IAM policy.
Export audit logs to Cloud Storage to store your logs for long periods of time.
Audit who has the ability to change your IAM policies on your projects.
Restrict access to logs using logging roles.
Apply the same access policies to the Google Cloud resource that you use to export logs as applied to the logs viewer.
Use Cloud Audit Logs to regularly audit access to service account keys.
Key services
IAM authorizes who can access and take action on specific Google Cloud resources, and gives you full control and visibility to centrally manage Google Cloud resources.
BeyondCorp Enterprise provides a zero-trust solution that enables an organization's workforce to access web applications securely from anywhere and without the need for VPN, while reducing the threats of malware, phishing, and data loss.
Cloud Asset Inventory provides inventory services based on a time series database. This database keeps a five week history of Google Cloud asset metadata. The service allows you to export all asset metadata at a certain timestamp or export event change history during a timeframe.
Cloud Audit Logs answers the questions of "who did what, where, and when?" within your Google Cloud resources.
Implement compute security controls
Private IPs
You can disable External IP access to your production VMs using organization policies. You can deploy private clusters with Private IPs within GKE to limit possible network attacks. You can also define network policy to manage pod-to-pod communication in the cluste
Compute instance usage
It's also important to know who can spin up instances and access control using IAM because you can incur significant cost if there is a break-in. Google Cloud lets you define custom quotas on projects to limit such activity. VPC Service Controls can help remediate this, for details, see the section on network security.
Compute OS images
Google provides you with curated OS images that are maintained and patched regularly. Although you can bring your own custom images and run them on Compute Engine, you still have to patch, update, and maintain them. Google Cloud regularly updates new vulnerabilities found through security bulletins and provides remediation to fix vulnerabilities for existing deployments.
GKE and Docker
App Engine flexible runs application instances within Docker containers, letting you run any runtime. You can also enable SSH access to the underlying instances, we do not recommend this unless you have a valid business use case.
Cloud Audit Logs is enabled for GKE, letting you automatically capture all activities with your cluster and monitor for any suspicious activity.
To provide infrastructure security for your cluster, GKE provides the ability to use IAM with role-based access control (RBAC) to manage access to your cluster and namespaces.
We recommend that you enable node auto-upgrade to have Google update your cluster nodes with the latest patch. Google manages GKE masters, and they are automatically updated and patched regularly. In addition, use Google-curated container-optimized images for your deployment. These are also regularly patched and updated by Google.
GKE Sandbox is a good candidate for deploying multi-tenant applications that need an extra layer of security and isolation from their host kernel.
Design questions
How do you manage security of your computing nodes?
Do you need host-based protection?
Do you maintain curated hardened images?
How do you control who can create or delete compute nodes in your production environment?
How frequently do you audit compute creation and deletion.
Do you perform security testing on your deployments in a sandboxed environment? How frequently do you perform and monitor these and update compatibility with the current version of deployment nodes?
Recommendations
Isolate VM communication using service accounts when possible.
Disable external IP addresses at organization level, unless explicitly required.
Use Google-curated images.
Track security bulletins for new vulnerabilities and remediate your instances.
Use private master deployment when using GKE.
Use Workload Identity to access control Cloud API access from your GKE clusters.
Enable GKE node auto upgrade.
Secure your network
Network intrusion detection
Many customers use advanced security and traffic inspection tools on-premises, and need the same tools to be available in the cloud for certain applications. VPC packet mirroring lets you troubleshoot your existing Virtual Private Clouds (VPCs). With Google Cloud packet mirroring, you can use third-party tools to collect and inspect network traffic at scale, provide intrusion detection, application performance monitoring, and better security controls, helping you ensure the security and compliance of workloads running in Compute Engine and Google Kubernetes Engine (GKE).
Traffic management
For production deployments, review configured routes under each VPC; strict and limited scoped rules are recommended. For GKE deployments, use Traffic Director to scale envoy management, or Istio for traffic flow management.
Network connectivity
Within Google Cloud, choose Virtual Private Cloud or use VPC peering. Network tags and service accounts don't translate over peered projects, but Virtual Private Cloud can help centralize them on the host project. Virtual Private Cloud makes it easier to centralize service accounts and network tags, but we recommend that you carefully plan how to manage quotas and limitations. VPC peering might introduce duplicated effort to manage these controls, but it gives you more flexibility on quotas and limitations.
For external access, evaluate your bandwidth needs and choose between Cloud VPN, Cloud Interconnect, or Partner Interconnect. It's possible to centralize jump points through a single VPC or project to minimize network management.
VPC Service Controls provides an additional layer of security defense for Google Cloud services that is independent of IAM. While IAM enables granular identity-based access control, VPC Service Controls enables broader context-based perimeter security, including controlling data egress across the perimeter. Use VPC Service Controls and IAM for defense in depth.
Built-in tools
Security Command Center provides multiple detectors that help you analyze the security of your infrastructure, for example, Event Threat Detection (Event Threat Detection), Google Cloud Armor logs, and Security Health Analytics (SHA). Enable services you need for your workloads and only monitor and analyze required data.
Network Intelligence Center gives you visibility into how your network topology and architecture are performing. You can get detailed insights into network performance and can optimize your deployment to eliminate any bottlenecks on your service. Network reachability provides you with insights into the firewall rules and policies that are applied to the network path.
Key services
VPC Service Controls helps improve your ability to mitigate the risk of data exfiltration from Google-managed services like Cloud Storage and BigQuery. With VPC Service Controls, you can configure security perimeters around the resources of your Google-managed services and control the movement of data across the perimeter boundary.
Traffic Director is Google Cloud's fully managed traffic control plane for service meshes. Using Traffic Director, you can deploy global load balancing across clusters and VM instances in multiple regions, offload health checking from the service proxies, and configure sophisticated traffic control policies. Traffic Director uses open standard APIs (xDS v2) to communicate with the service proxies in the data plane, ensuring that you are not locked in to a proprietary solution and allowing you to use the service mesh control plane of your choice.
Security Command Center provides visibility into what resources are in Google Cloud and their security state. Security Command Center helps make it easier for you to prevent, detect, and respond to threats. It helps you identify security misconfigurations in virtual machines, networks, applications, and storage buckets from a centralized dashboard and take action on them before they can potentially result in business damage or loss.
Istio is an open service mesh that provides a uniform way to connect, manage, and secure microservices.
Packet Mirroring allows lets you to mirror your network traffic and send it to a third-party security solution such as an Intrusion Detection Solution (IDS) for proactively detecting threats and responding to intrusions.
Implement data security controls
You can implement data security controls in relation to three areas: encryption, storage, and databases.
Encryption
Google Cloud offers a continuum of encryption key management options to meet your needs. Identify the solutions that best fit your requirements for key generation, storage, and rotation, whether you are choosing for your storage, compute, or big data workloads. Use encryption as one piece of a broader data security strategy.
Default Encryption: Google Cloud encrypts customer data stored at rest by default, with no additional action required from you. For details on how envelope encryption works, see Encryption at Rest in Google Cloud.
Custom Encryption: Google Cloud lets you use envelope encryption to encrypt your data while storing the key encryption key in Cloud Key Management Service (Cloud KMS). Google Cloud also provides Cloud Hardware Security Modules (HSMs) if you need them. Using IAM permissions with Cloud KMS/HSM at the user-level on individual keys helps you manage access and the encryption process. To view admin activity and key use logs, use Cloud Audit Logs. To secure your data, monitor logs using Monitoring to ensure proper use of your keys.
Storage
Cloud Storage offers Object Versioning, which we recommend be turned on for objects that need to maintain state. Versioning introduces additional storage cost, and a careful tradeoff should be made for sensitive objects.
Object Lifecycle Management helps to archive older objects and downgrade storage class to save cost. These operations need careful planning because there might be charges related to changing storage class and accessing data.
Retention policies using Bucket Lock allow you to govern how long objects in the bucket must be retained for compliance and legal holds. Note that once you lock the bucket with a certain retention policy, you cannot remove or delete the bucket before the expiration date.
Access control
IAM permissions grant access to buckets as well as bulk access to a bucket's objects. IAM permissions give you broad control over your projects and buckets, but not fine-grained control over individual objects.
Access Control Lists (ACLs) grant read or write access to users for individual buckets or objects. In most cases, we recommend that you use IAM permissions instead of ACLs. Use ACLs only when you need fine-grained control over individual objects.
Signed URLs (query string authentication) give time-limited read or write access to an object through a URL that you generate. Anyone with whom you share the URL can access the object for the duration that you specify, regardless of whether they have a Google account.
Signed URLs also come in handy when you want to delegate access to private objects for a limited period of time.
Signed policy documents specify what can be uploaded to a bucket. Policy documents allow greater control over size, content type, and other upload characteristics than signed URLs, and can be used by website owners to allow visitors to upload files to Cloud Storage.
Customer-managed encryption keys (CMEK). You can generate and manage your encryption keys using Cloud KMS, which act as an additional encryption layer on top of the standard Cloud Storage encryption.
Customer-supplied encryption keys (CSEK). You can create and manage your own encryption keys. This is an additional layer of encryption to standard Cloud Storage encryption.
Note that keys in Cloud KMS are replicated and made available by Google; the security and availability of CSEK is your responsibility. We recommend a careful decision trade-off. You can always choose to perform client-side encryption and store encrypted data in Cloud Storage, which is encrypted with server-side encryption.
Persistent disks are automatically encrypted, but you can choose to supply or manage your own keys. You can store these keys in Cloud KMS or Cloud HSM, or you can supply them from your on-premises devices. Customer-supplied keys are not stored in instance templates nor in any Google infrastructure; therefore, the keys cannot be recovered if you lose them.
Persistent disk snapshots by default are stored in the multi-region that is closest to the location of your persistent disk, or you can choose your region location. You can easily share snapshots to restore new machines within the project in any new region. To share them with other projects, you need to create a custom image.
Cost control. Use cache-control metadata to analyze frequently accessed objects, and Cloud CDN for caching static public content.
Key services
Cloud Key Management Service lets you keep millions of cryptographic keys, allowing you to determine the level of granularity at which to encrypt your data. Set keys to automatically rotate regularly, using a new primary version to encrypt data and limit the scope of data accessible with any single key version. Keep as many active key versions as you want
Use database access controls
Instance-level access
Database access
Implement data security
Cloud Data Loss Prevention (DLP) provides tools to classify, mask, tokenize, and transform sensitive elements to help you better manage the data that you collect, store, or use for business or analytics. For example, features like format-preserving encryption or tokenization allow you to preserve the utility of your data for joining or analytics while obfuscating the raw sensitive identifiers.
Key services
Cloud DLP (DLP) helps you better understand and manage sensitive data. It provides fast, scalable classification and redaction for sensitive data elements like credit card numbers, names, Social Security numbers, US and selected international identifier numbers, phone numbers, and Google Cloud credentials. Cloud DLP classifies this data using more than 120 predefined detectors to identify patterns, formats, and checksums, and even understands contextual clues. You can optionally redact data using techniques like masking, secure hashing, tokenization, bucketing, and format-preserving encryption.
Build apps with supply chain security controls
Without automated tools, increasingly complex application environments that are deployed, updated, or patched make it hard to meet consistent security requirements. Building a CI/CD pipeline solves many of these issues. For details, see Operational excellence.
Container security
Container Analysis helps vulnerability scanning and fixing issues before container deployments. Container Analysis stores metadata for scanned images that can help identify the latest vulnerabilities and patch or update them.
Binary Authorization helps sign containers with one or multiple unique attestations. Such attestations along with policy definitions help you identify, control, and only deploy approved containers during runtime. It's a best practice to set up a strict policy model and at least one signer to approve and sign-off on container deployment.
Web Security Scanner scans your deployed application for vulnerabilities during runtime. You can configure Web Security Scanner to interact with your application as a signed user that navigates and crawls through various pages scanning for vulnerabilities. We recommended running scans on a test environment that reflects production to avoid unintended behaviors.
Key services
Container Registry, Container Analysis, Binary Authorization , Security Command Center
Audit your infrastructure
Cloud Logging provides audit logging for your Google Cloud services. Cloud Audit Logs helps security teams maintain audit trails in Google Cloud. Because Cloud Logging is integrated into all Google Cloud services, you can log details for long-term archival and compliance requirements. Logging data-access logs can get costly, so be sure to plan carefully before enabling this feature.
Runtime security
GKE integrates with various partner solutions for runtime security to provide you with robust solutions to monitor and manage your deployment. All these solutions can be built to integrate with Security Command Center, providing you with a single pane of glass.
Reliability principles
Design and operational principles
To maximize system reliability, the following design and operational principles apply. Each of these principles is discussed in detail in the rest of the Architecture Framework reliability category.
Define your reliability goals https://cloud.google.com/architecture/framework/reliability/define-goals
The high-level considerations for this section of the Architecture Framework include the following areas:
Choose appropriate SLIs.
The type of service you run also determines what SLIs to monitor, as shown in the following examples. Refer this for more detail : https://cloud.google.com/architecture/framework/reliability/define-goals
The following SLIs are typical in systems that serve data: Availability, Latency, Quality.
The following SLIs are typical in systems that Data processing systems: Coverage, Correctness, Freshness, Throughput.
The following SLIs are typical in systems that Storage systems: Durability, Throughput and latency are also common SLIs for storage systems.
Choose SLIs and set SLOs based on the user experience
If possible, instrument the mobile or web client.
If that's not possible, instrument the load balancer.
A measure of reliability at the server should be the last option.
If you can't measure the customer experience and define goals around it, you can run a competitive benchmark analysis. If there's no comparable competition, measure the customer experience, even if you can't define goals yet.
Iteratively improve SLOs
SLOs shouldn't be set in stone. Revisit SLOs quarterly, or at least annually, and confirm that they continue to accurately reflect user happiness and correlate well with service outages. Make sure that they cover current business needs and new critical user journeys. Revise and augment your SLOs as needed after these periodic reviews.
Iteratively improve SLOs.
Use strict internal SLOs.
Use error budgets to manage development velocity.
Build observability into your infrastructure and applications
Observability includes monitoring, logging, tracing, profiling, debugging, and similar systems.
Design for scale and high availability
Create redundancy for higher availability
https://cloud.google.com/architecture/framework/reliability/design-scale-high-availability
The following examples provide for redundancy and might be part of your system architecture:
Set up storage systems to replicate data across zones to permit failover.
To enable disaster recovery if there's a regional outage, archive or replicate data to a remote region.
To isolate failures in DNS registration to individual zones, use zonal DNS names for instances on the same network to access each other.
If your service needs to run even when an entire region is down, design the service to use pools of compute resources spread across different regions.
Use data replication across regions and automatic failover when a region goes down. Some Google Cloud services have multi-regional variants, such as BigQuery and Cloud Spanner. To be resilient against regional failures, use these multi-regional services in your design where possible.
Eliminate scalability bottlenecks
Identify system components that can't grow beyond the resource limits of a single VM or a single zone. If possible, redesign these components to scale horizontally such as with sharding, or partitioning, across VMs or zones. To handle growth in traffic or usage, you add more shards. Use standard VM types that can be added automatically to handle increases in per-shard load.
If you can't redesign the application, you can replace components with managed services that are designed to scale horizontally with no user action.
Degrade service levels gracefully when overloaded
Design your services to tolerate overload. Services should detect overload and return lower quality responses to the user or partially drop traffic, not fail completely under overload.
For example, a service can respond to user requests with static web pages and temporarily disable dynamic behavior that's more expensive to process. Or, the service can allow read-only operations and temporarily disable data updates.
Operators should be notified to correct the error condition when a service degrades.
Prevent and mitigate traffic spikes
Implement spike mitigation strategies on the server side such as throttling, queueing, load shedding or circuit breaking, graceful degradation, and prioritizing critical requests.
Mitigation strategies on the client include client-side throttling and exponential backoff with jitter.
Sanitize and validate inputs
To prevent erroneous, random, or malicious inputs that cause service outages or security breaches, sanitize and validate input parameters for APIs and operational tools.
Fail safe in a way that preserves function
If there's a failure due to a problem, the system components should fail in a way that allows the overall system to continue to function. These problems might be a software bug, bad input or configuration, an unplanned instance outage, or human error. What your services process helps to determine whether you should be overly permissive or overly simplistic, rather than overly restrictive
Design API calls and operational commands to be retryable
APIs and operational tools must make invocations retry-safe as far as possible. A natural approach to many error conditions is to retry the previous action, but you might not know whether the first try was successful.
Your system architecture should make actions idempotent - if you perform the identical action on an object two or more times in succession, it should produce the same results as a single invocation. Non-idempotent actions require more complex code to avoid a corruption of the system state.
Identify and manage service dependencies
Service designers and owners must maintain a complete list of dependencies on other system components. The service design must also include recovery from dependency failures. Include external dependencies, such as third-party service APIs. Minimize the number of critical dependencies whose failure might cause service outages.
To render failures in your service less harmful to other components that depend on it, consider the following example design techniques and principles:
Use request queues.
Cache responses from other services to convert them into non-critical dependencies.
Fail safe in a way that preserves function.
Degrade gracefully when there's a traffic overload.
Startup dependencies
Services behave differently when they start up compared to their steady-state behavior. Startup dependencies can differ significantly from steady-state runtime dependencies.
For example, at startup, a service may need to load user or account information from a user metadata service that it rarely invokes again. When many service replicas restart after a crash or routine maintenance, the replicas can sharply increase load on startup dependencies, especially when caches are empty and need to be repopulated.
Test service startup under load, and provision startup dependencies accordingly. Consider a design to gracefully degrade by saving a copy of the data it retrieves from critical startup dependencies. This behavior allows your service to restart with potentially stale data rather than being unable to start when a critical dependency has an outage. Your service can later load fresh data, when feasible, to revert to normal operation.
Startup dependencies are also important when you bootstrap a service in a new environment. Design your application stack with a layered architecture, with no cyclic dependencies between layers. Cyclic dependencies may seem tolerable because they don't block incremental changes to a single application. However, cyclic dependencies can make it difficult or impossible to restart after a disaster takes down the entire service stack.
Ensure that every change can be rolled back
To apply the guidance in the Architecture Framework to your own environment, follow these recommendations:
Implement exponential backoff with randomization in the error retry logic of client applications.
Implement a multi-region architecture with automatic failover for high availability.
Use load balancing to distribute user requests across shards and regions.
Design the application to degrade gracefully under overload. Serve partial responses or provide limited functionality rather than failing completely.
Establish a data-driven process for capacity planning, and use load tests and traffic forecasts to determine when to provision resources.
Establish disaster recovery procedures and test them periodically.
Create reliable operational processes and tools
https://cloud.google.com/architecture/framework/reliability/create-operational-processes-tools
Implement progressive rollouts with canary testing
Spread out traffic for timed promotions and launches
Automate build, test, and deployment
Defend against operator error
Test failure recovery
Conduct disaster recovery tests
Practice chaos engineering
Build efficient alerts
Optimize the alert delay
There's a balance between alerts that are sent too soon that stress the operations team and alerts that are too late and cause long service outages. Tune the alert delay before the monitoring system notifies humans of a problem to minimize time to detect, while maximizing signal versus noise. Use the error budget consumption rate to derive the optimal alert configuration.
Alert on symptoms rather than causes
Trigger alerts based on the direct impact to user experience. Noncompliance with global or per-customer SLOs indicates a direct impact. Don't alert on every possible root cause of a failure, especially when the impact is limited to a single replica. A well-designed distributed system recovers seamlessly from single-replica failures.
Build a collaborative incident management process
Incident management overview
Assign clear service ownership
Reduce mean time to detect (MTTD) with well tuned alerts
Reduce mean time to mitigate (MTTM) with incident management plans and training
Design dashboard layouts and content to minimize MTTM
Use blameless postmortems to learn from outages and prevent recurrences
Incident management plan example
Performance and cost optimization
Strategies
Evaluate performance requirements. Determine the priority of your various applications and what minimum performance you require of them.
Use scalable design patterns. Improve scalability and performance with autoscaling, compute choices, and storage configurations.
Identify and implement cost-saving approaches. Evaluate cost for each running service while associating priority to optimize for service availability and cost.
Best practices
Use autoscaling and data processing.
Use GPUs and TPUs to increase performance.
Identify apps to tune.
Analyze your costs and optimize.
Use autoscaling and data processing
Use autoscaling so that as load increases or decreases, the services add or release resources to match.
Compute Engine autoscaling
Managed instance groups (MIGs) let you scale your stateless apps on multiple identical VMs, so that a group of Compute Engine resources is launched based on an instance template. You can configure an autoscaling policy to scale your group based on CPU utilization, load-balancing capacity, Cloud Monitoring metrics, schedules, and, for zonal MIGs, by a queue-based workload, like Pub/Sub.
Google Kubernetes Engine autoscaling
You can use the cluster autoscaler feature in Google Kubernetes Engine (GKE) to manage your cluster's node pool based on varying demand of your workloads. Cluster autoscaler increases or decreases the size of the node pool automatically, based on the resource requests (rather than actual resource utilization) of Pods running on that node pool's nodes.
Serverless autoscaling
Serverless compute options include Cloud Run, App Engine, and Cloud Functions, each of which provides autoscaling capabilities. Use these serverless options to scale your microservices or functions.
Data processing
Dataproc and Dataflow offer autoscaling options to scale your data pipelines and data processing. Use these options to allow your pipelines to access more computing resources based on the processing load.
Design questions
Which of your applications have variable user load or processing requirements? Which of your data processing pipelines have variable data requirements?
Recommendations
Use Google Cloud Load Balancers to provide a global endpoint. Use managed instance groups with Compute Engine to automatically scale. Use the cluster autoscaler in GKE to automatically scale the cluster. Use App Engine to autoscale your Platform-as-a-Service (PaaS) application. Use Cloud Run or Cloud Functions to autoscale your function or microservice.
Graphics Processing Unit (GPU)
Compute Engine provides GPUs that you can add to your virtual machine instances. You can use these GPUs to accelerate specific workloads on your instances such as machine learning and data processing.
Tensor Processing Unit (TPU)
A TPU is specifically designed as a matrix processor by Google for machine learning workloads. TPUs are best suited for massive matrix operations with a large pipeline, with significantly less memory access.
Identify apps to tune
Application Performance Management (APM) includes tools to help you reduce latency and cost, so that you can run more efficient applications. With Cloud Trace, Cloud Debugger, and Cloud Profiler, you gain insight into how your code and services function, and you can troubleshoot if needed.
Instrumentation
Latency plays a big role in determining your users' experience. When your application backend starts getting complex or you start adopting microservice architecture, it's challenging to identify latencies between inter-service communication or identify bottlenecks. Cloud Trace and OpenTelemetry tools help you scale collecting latency data from deployments and quickly analyze it.
Debugging
Cloud Debugger helps you inspect and analyze your production code behavior in real time without affecting its performance or slowing it down.
Profiling
Poorly performing code increases the latency and cost of applications and web services. Cloud Profiler helps you identify and address performance by continuously analyzing the performance of CPU or memory-intensive functions executed across an application.
Analyze your costs and optimize
The first step in optimizing your cost is to understand your current usage and costs. Google Cloud provides an Export Billing to BigQuery feature that provides a detailed way to analyze your billing data. You can connect BigQuery to Google Data Studio or Looker, or to third-party business intelligence (BI) tools like Tableau or Qlik. Use the programmatic notifications feature to send notifications when your budget exceeds a certain threshold. You can use budget notifications with third-party solution providers as well as customized applications.
Sustained use discounts are automatic discounts for running specific Compute Engine resources for a significant portion of the billing month. Sustained use discount is granted for prolonged usage of certain Compute Engine virtual machine (VM) types.
Committed use discounts are ideal for workloads with predictable resources needs. When you purchase a committed use contract, you purchase a certain amount of vCPUs, memory, GPUs, and local SSDs at a discounted price in return for committing to paying for those resources for 1 year or 3 years.
A Preemptible VM is an instance that you can create and run at a much lower price than normal instances. However, Compute Engine might terminate (that is, preempt) these instances if it requires access to those resources for other tasks. Preemptible instances are excess Compute Engine capacity, so their availability varies with usage.
When you understand which components make up your cost, you can decide how to optimize. Finding resources with low utilization or that aren't necessary is an excellent place to start. Compute Engine provides you with sizing recommendations for VMs that you can use to help size your resources. After you implement changes, you can compare your subsequent billing export data to view the differences in cost.
Want to forecast your usage cost? Use the Google Cloud Pricing Calculator.