CloudServices - henk52/knowledgesharing GitHub Wiki

Cloud Services

Introduction

References

Vocabulary

  • ACID: Atomic, Consistent, Isolated, Durable (RDBMS)
  • ADC: Application Delivery Controller. https://www.oreilly.com/learning/developer-defined-application-delivery?imm_mid=0ee8c5&cmp=em-webops-na-na-newsltr_20170310
  • BASE: (Contrived acronym(Fow))
  • CAP Theorem: Consistency, Availability, PartitionTolerance - Pick any 2(Fow)
    • Partition; Do you want consistency or availability?(it can be a spectrum/sliding choice).
    • It is actually more a Consistency vs Response time.
  • SOA: Service Oriented Architecture
  • Microservice:
  • RDBMS: Tables
  • OLAP: Cubes
  • NoSQL: Collections (Originally a hashtag for a single meeting).
  • Impedance Mismatch problem: Cohesive data(on screen) splattered accross multiple tables.

Tools

View list

Micro services

Docker

Kafka

Notes

  • I recommend that you create and maintain services that determine your competitive advantage yourself and then use third-party services for other tasks(Kol19?).

12 factor app

The Twelve-Factor App approach is a methodology for building Software as a Service (SaaS) applications to fulfill the following three objectives(Kol19,p59):

  • Configurations in declarative formats

  • Maximum portability with operating systems and clouds

  • Continuous deployment and scaling

  • https://12factor.net/

Data Storage

NoSQL

  • Scalability

  • Performance

  • High availability

  • Less functionality than RDBMS, More performance.

  • Encapsulate the DB

Characteristics(Fow)

  • non-relational
  • open-source
  • cluster-friendly: able to run on big clusters.
  • 21st Century Web
  • Schema-less

Data model/Type of NoSQL storage

  • Key Values Store.
    • Memcached
    • Coherence
    • Redis
    • Project Voldemort
    • riak
  • Tabular:
    • BigTable
    • Hbase
    • Accumulo
    • Dynamo?
  • Document Oriented(complex data) - usually done in json, but could be done in XML.
    • MongoDB
    • Couch DB
    • Cloudant
    • RAVEN DB
  • Column.family
    • Cassandra
    • Apache HBASE
  • Graph (Tend to be ACID)
    • Neo4J

What is Missing:

  • No 'Joins' support
  • No cpmplex transactions support
  • No constraints support

What is available:

  • Query Language(Other than SQL)
  • Fast performance
  • Horizontal Scalability

When to use:

  • The ability to store and retrieve great quantities of data is important.
  • Storing relationships between the elements is not important
  • Dealing with growing lists of elements: twitter posts, internet server logs, blogs
  • The data is not structured or the structure is changing with time.
  • Prototypes or fast applications need to be developed.
  • Constraints and validations logic is not required to be implemented in database.

When not to use:

  • Complex transactions need to be handled
  • Joins must be handled by databases. Validations must be handled by databases.

Fowlers quick select:

  • User sessions: Redis
  • Financial data: RDBMS
  • Shopping cart: Riak
  • Recomendations: Neo4J
  • Product catalog: MongoDB
  • Reporting: RDBMS
  • Analytics: Cassandra
  • User activity logs: Cassandra

Design concepts

Keep your transactions within a single agragate(Fow).

Multiple agents updating the same data at the same time

Use delta/diff from each update.

Notice that multiple applications can add columns to the same row without conflicting–they only conflict if they try to update a column with the same name. Hence, we can resolve the conflict by "moving" the column name in the Fields table to the row key, creating a compound key: /. (For this example, we're using a slash ‚'/' to separate the two parts of the key.) This will yield one row per field name/field value pair.

It also doesn't matter which order the column updates occur since they will all be added to the same row. The only rule we have to follow is that each log record is processed by a single update application.

When updating a page:

  1. Send full structure back, with information on what changed in which fields.
  2. Apply the updates.
  3. refresh the whole structure
  4. Send the structure back to the page

message bus

cynefin model

To watch

Challenges:

Inexperience

Challenge 1: Deconstructing databases.

  • Microserices view:

    • event bus replaces operational database.
    • DB per MicroService (if persistence needed)
    • Poly-glot (various NoSQL, SQL)
    • Few(10%) writable; even fewer transactional.
  • Rapids: every event

  • Rivers: Themed events

  • Ponds: State/History

Need High-Performance event bus (like kafka) 250k msgs/sec (Both reads and writes counts as a msg) OMQ

  • Always publish to the river

  • Always listen to the river.

  • Event publishing

  • Solution collecting Redis

Challenge 2: Synchronous or Asynchronous

Chad Fowler vs. Fred George

  • Chad: Use Synchronous as default
    • Alorith typically described serially
    • programmer understanding -> productivity
  • Fred: Use Asynchronous as default
    • Robustness should be primary goal
    • Supports bette de-coupling (which gives easier testing)
    • Teach the programmers!

30:37 Post need, select Solution.

34:58 - Service Taxonomy

Challenge 3: MicroServices or Clojure

Microservices like OO

  • Conceptualization(job)
    • Every service has one job
    • If two jobs, make two services
  • Communication
    • Minimize messages (whether RESTful or Events)
  • Encapsulate
    • Service has its own persistence
    • If sharing persistence, just one logical service

Clojure Loves Shared data

Challenge 4: Choosing architectures and frameworks (we candidates, no established ones)

PIGATO - an high-perforamce microservices framework based on RabbitMQ

Challenge 5: No Design patterns book yet.

Challenge 6: Corequisite technology, processes, and organization

Challenge 7: What is a microservice

Taxonomy may be useful(before it's to late)

  • Synchroniciti degree
    • primary API access to services
  • Ratio of number of services and average size
    • Zones for clarity
  • DB / service ratio
    • Expose potential DB hindrances to rapid deployment

Real World Microservices

  • https://www.youtube.com/watch?v=1aaw7iYS_VM&t=303s

  • Values and principals / complexities

  • Autonomy / Communication

    • The people responisble for the external interfaces needs to be very good at communication.
  • Speed of change / Execution

    • You need to pay for automation of testing and deployment to keep up with the flux of releases.
  • scale / Ressilience

    • Providing consistencies.
  • Composabilities / Maintenance

  • Tech diversity / Operational overhead

Spliting from monolith

Brown field

15min Possibly boundary indicators:

  • Domain bounded context(e.g. language change)
  • Rate of change
  • Team structure (Conways law)
  • What hurts most
    • E.g. put service in from of pain point legacy services.

Green field

20min (neo4j)

  • Evolutionary
    • One service at a time(not 20 in a bang).
  • Back to basic principle
    • Single responsibility
    • Loose coupling
    • High cohesion
    • SOLID principles

Architecture

In practice

Microservices at Netflix Scale

First principles

  • Buy vs build
    • Use or contribute to OSS technologies first.
    • Only build what you have to.
  • Services should be stateless*
    • Must not rely on sticky sessions
    • Prove by Chaos testing
    • (*Except the persistence/Caching layers)
  • Scale out vs. scale up
    • if you keep scaling up, you'll hit a limit.
    • Horizontal scaling gives you a longer runway.
  • Redundancy and isolation for resiliency
    • Make more than one of anything
    • Isolate the blast radius for any given failure.
  • Automate destructive testing
    • Simian Army
    • Started with Chaos Monkey

Stateless services

Time: 11:00

  • Register into service discovery
  • Implement externally callable host check
    • Be able to verify it is operational.
  • Be able to get information on how to connect to other services.
  • Verify stateless: Caos monkey
    • Randomly kill services

Time: 13:34 Data

  • Data - from RDBMS to Cassandra
    • Multi-regional replication
  • Billing - 15:40

Benefit from microservices

16:35

  • Our priorities
      1. Innovation
      • Tight coupling doesn't work
      • Loose coupling: Each team work independently. End to end ownership
        • Develop, test, Deploy, Support
        • Architect -> Design -> Develop -> Review -> Test -> Deploy -> Run -> Support -> Architect...
      • Separation of concerns: 19:23
        • UI: Feature A > Feature B > Feature C
        • Personalization: Feature D > A/B test E
        • Mid-tier: A/B Test F > Feature H
        • Infrastructure: Availabililty > Scalability > Security
      1. Reliability
      1. Efficiency

Cost of microservices

20:00

  • Microserives is an Org change
  • Evolving the organization
  • Central infrastructure investment
  • Migration doesn't happen overnight(Roman horse riding(two horses, one rider))
    • Living in the hybrid world
    • Supporting 2 tech stacks
    • Double the maintenane
    • Multi-master data replication

Microservice lessons learned

23:16

  • IPC is crucial for loose coupling
    • Common language between the services
    • Establishes the contract of interaction
  • Caching to protect DBs: 24:56 (Most heavily hit DBs) 1 Read from cache 2 on cache miss call service 3 service calls DB and responds 4 service updates the cache

Operational visibility matters (Telemetry) - 26:21

  • Of you can't see it, you can't improve it.

  • Will your telemetry scale

    • Observe -> orient -> Decide -> Act -> Observe...
    • 20 mil metrics per sec at netflix
      • Some of this data will be fed into automated error correction tools
  • You don't have the luxury of an arcitectural diagrams because things change all the time, you must have run-time dicern:

    • who calls who
    • how
    • where are the errors
    • where is traffic flowing
    • is there congestion into the system

Reliability matters

  • Cascading failures affect the whole uptime
  • Curcuite breaker
    • Detect the problem
    • Is it
      • If not fatal, go to fall backs:
        • Hystrix
        • FIT: Failure Injection Test framework
          • latency
          • others
    • Mothly randomly select on region, and fail it. As a user, you will not see a thing (Chaos Kong)

A word on containers

  • Containers change the level of encapsulation from VM to process
  • Containers can help deliver great developer experience
  • To run containers in production at scale... 37:35
    • Titus UI/API
    • Fenzo
    • Cassandra
    • Zookeeper
    • Docker
    • mesos

Microservices resources

  • netflix.github.com

Summary

  • Microservices at scale require organizational change and centralized infrastructure investment.

  • Be aware of your situation and what works for you. 41:33

  • Zuul: Front end proxy

  • Deployment: Asgaard, spinaker

  • Visual : Flux and Flow

Cloud service components

Application Delivery Controllers (ADC)

Monitoring

Probably ties into ADC.

failures and errors will be detected, mitigated, and resolved before they bring down any part of the microservice ecosystem.

Monitoring a production-ready microservice has four components (https://www.oreilly.com/learning/monitoring-a-production-ready-microservice?imm_mid=0ee8c5&cmp=em-webops-na-na-newsltr_20170310)

  1. The first is proper logging of all relevant and important information,
    • which allows developers to understand the state of the microservice at any time in the present or in the past.
  2. The second is the use of well-designed dashboards that accurately reflect the health of the microservice,
    • and are organized in such a way that anyone at the company could view the dashboard and understand the health and status of the microservice without difficulty.
  3. The third component is actionable and effective alerting on all key metrics,
    • a practice that makes it easy for developers to mitigate and resolve problems with the microservice before they cause outages.
  4. The final component is the implementation and practice of running a sustainable on-call rotation responsible for the monitoring of the microservice.
  • the behavior of a microservice is the sum of its behavior across all of its instantiations

  • identifying which properties of a microservice are necessary and sufficient for describing its behavior, and then determining what changes in those properties tell us about the overall status and health of the microservice.

  • Host and infrastructure metrics are those that pertain to the status of the infrastructure and the servers on which the microservice is running,

  • while microservice metrics are metrics that are unique to the individual microservice.

  • the CPU utilized by the microservice on each host,

  • the RAM utilized by the microservice on each host,

  • the available threads,

  • the microservice’s open file descriptors (FD),

  • and the number of database connections that the microservice has to any databases it uses.

developers should be able to know how much CPU their microservice is using on one particular host and how much CPU their microservice is using across all hosts it runs on.

Microservice key metrics:

  • Language-specific metrics
  • Availability
  • SLA
  • Latency
  • Endpoint success
  • Endpoint responses
  • Endpoint response times
  • Clients
  • Errors and exceptions
  • Dependencies

we also must monitor the availability of the service, the service-level agreement (SLA) of the service, latency (of both the service as a whole and its API endpoints), success of API endpoints, responses and average response times of API endpoints, the services (clients) from which API requests originate (along with which endpoints they send requests to), errors and exceptions (both handled and unhandled), and the health and status of dependencies.

Importantly, all key metrics should be monitored everywhere that the application is deployed. This means that every stage of the deployment pipeline should be monitored. Staging must be closely monitored in order to catch any problems before a new candidate for production (a new build) is deployed to servers running production traffic.

  • Logging needs to be such that developers can determine from the logs exactly what went wrong and where things fell apart.
    • tracking and logging requests and responses throughout the entire client and dependency chains from end-to-end can illuminate important information about the system that would otherwise go unknown (such as total latency and availability of the stack).
    • logging is expensive: they are expensive to store, they are expensive to access, and both storing and accessing logs comes with the additional cost associated with making expensive calls over the network.
    • Avoid adding debugging logs in code that will be deployed to production—such logs are very costly.
      • If any logs are added specifically for the purpose of debugging, developers should take great care to ensure that any branch or build containing these additional logs does not ever touch production.
  • Logging needs to be scalable, it needs to be available, and it needs to be easily accessible and searchable.
    • it’s often necessary to impose
      • per-service logging quotas
      • limits and standards on what information can be logged
      • how many logs each microservice can store
      • how long the logs will be stored before being deleted.

Dashboard

  • Every microservice must have at least one dashboard where all key metrics (such as hardware utilization, database connections, availability, latency, responses, and the status of API endpoints) are collected and displayed.

  • A dashboard is a graphical display that is updated in real time to reflect all the most important information about a microservice.

  • Dashboards should be easy to interpret so that an outsider can quickly determine the health of the microservice: anyone should be able to look at the dashboard and know immediately whether or not the microservice is working correctly.

  • A dashboard should also serve as an accurate reflection of the overall quality of monitoring of the entire microservice.

    • Any key metric that is alerted on should be included in the dashboard.
  • the exclusion of any key metric in the dashboard will reflect poor monitoring of the service,

  • while the inclusion of metrics that are not necessary will reflect a neglect of alerting (and, consequently, monitoring) best practices.

  • In addition to key metrics, information about each phase of the deployment pipeline should be displayed, though not necessarily within the same dashboard.

    • Developers working on microservices that require monitoring a large number of key metrics may opt to set up separate dashboards for each deployment phase
      • one for staging
      • one for canary
      • and one for production
    • to accurately reflect the health of the microservice at each deployment phase
  • developers should never need to watch a microservice’s dashboard in order to detect incidents and outages.

  • To assist in determining problems introduced by new deployments, it helps to include information about when a deployment occurred in the dashboard.

    • The most effective and useful way to accomplish this is to make sure that deployment times are shown within the graphs of each key metric.
  • Well-designed dashboards also give developers an easy, visual way to detect anomalies and determine alerting thresholds.

  • Very slight or gradual changes or disturbances in key metrics run the risk of not being caught by alerting, but a careful look at an accurate dashboard can illuminate anomalies that would otherwise go undetected.

Alerting

  • The detection of failures, as well as the detection of changes within key metrics that could lead to a failure, is accomplished through alerting.

  • Effective and actionable alerting is essential to preserving the availability of a microservice and preventing downtime.

  • Alerts must be set up for all key metrics.

    • Any change in a key metric at the host level, infrastructure level, or microservice level that could
      • lead to an outage,
      • cause a spike in latency,
      • or somehow harm the availability of the microservice
    • should trigger an alert. Importantly,
    • alerts should also be triggered whenever a key metric is not seen.
  • Three types of thresholds should be set for each key metric, and have both upper and lower bounds:

    • normal
      • reflect the usual, appropriate upper and lower bounds of each key metric and shouldn’t ever trigger an alert.
    • warning
      • Warning thresholds on each key metric will trigger alerts when there is a deviation from the norm that could lead to a problem with the microservice;
        • warning thresholds should be set such that they will trigger alerts before any deviations from the norm cause an outage or otherwise negatively affect the microservice.
    • critical.
      • should be set based on which upper and lower bounds on key metrics actually cause an outage, cause latency to spike, or otherwise hurt a microservice’s availability.
  • In an ideal world, warning thresholds should trigger alerts that lead to quick detection, mitigation, and resolution before any critical thresholds are reached.

    • In each category, thresholds should be
      • high enough to avoid noise,
      • but low enough to catch any and all real problems with key metrics.
  • To determine the appropriate thresholds for a new microservice (or even an old one), developers can run load testing on the microservice to gauge where the thresholds should lie.

    • Running "normal" traffic loads through the microservice can determine the normal thresholds,
    • while running larger-than-expected traffic loads can help determine warning and critical thresholds.
  • The first step is to create step-by-step instructions for each known alert that detail how to triage, mitigate, and resolve each alert.

  • Runbooks are crucial to the monitoring of a microservice: they allow any on-call developer to have step-by-step instructions on how to mitigate and resolve the root causes of each alert.

  • any alert that, once triggered, requires a simple set of steps to be taken in order to be mitigated and resolved, can be easily automated away.

  • Once this level of production-ready monitoring has been established, a microservice should never experience the same exact problem twice.

To prevent burnout, on-call rotations should be both brief and shared: no fewer than two developers should ever be on call at one time, and on-call shifts should last no longer than one week and be spaced no more frequently than one month apart.

From Hands-on Microservices with Rust

  • Be careful of the depth of the dependency tree

    • Imagine that you have a microservice that has to wait for the response of another microservice to send a response to a client. The other microservice, in turn, also has to wait for another microservice, and so on(Kol18, p115).
  • loose coupling means that a microservice doesn't know anything about other microservices, or how many there are(Kol18, p115).

  • message-driven, when you use messages as a unit of interaction(Kol18, p116).

    • To have totally uncoupled microservices, you should use a message queue or a message broker service(Kol18, p116).
    • I guess the downside is delay in responses
  • Claim: if your microservices have to process hundreds of thousands of messages, you should use asynchronous code(Kol18, p117).

    • I assume that this claim requires you to be able to throw a lot of resources at the micro service,
    • I also assume that for the same CPU power the async application might be slower in responding, compared to the single threaded app
  • Connecting microservices:

    • Message broker
    • Remote procedure calls(RPC)(Kol18, p118)
      • JSon-RPC
      • gRPC/protobuf
      • Thrift
      • XML-RPC
  • Reactive manifesto

⚠️ **GitHub.com Fallback** ⚠️