[Draft] Microservices - vinhtbkit/bkit-kb GitHub Wiki

1 - Intro

Key concepts

  • Independent deployability
  • Modeled around a business domain
  • Owning their own states
  • Size
A microservice should be as big as your head
  • Flexibility: allow solving any problems we might face: org, tech, scale, robustness...
  • Architecture and organization alignment

Advantages

  • Technology heterogeneity: using different techs inside each microservice -> pick the right tools for the job
  • Robustness:
    • Isolate the problematic service
  • Scaling
  • Ease of deployment:
    • Minor changes should not require redeployment of whole system
  • Organizational alignment
  • Compose-ability

Pain points

  • Dev experience
    • Local developer resources
  • Technology overhead
    • Different technologies in each service
    • New terms introduced
  • Cost
    • Resources: CPU /mem, licences,...
    • Learning curves
    • Not suitable for cost-reducing organizations
  • Reporting
    • Difficult to gather information
    • Has to adopt new techniques
  • Monitoring and troubleshooting
    • Tracking the problematic service
  • Security
    • Information flows over network -> target of MitM attacks
  • Testing
  • Latency
  • Data consistency

Not for

  • Brand new products or startups
    • Things are going to change quickly -> changes to service boundaries
  • Organizations creating software that will be deployed and managed by their customers (don't have IT backgrounds)

Good for

  • Big teams
  • SaaS applications: 247 operations, rolling changes are required
  • Orgs looking to provide services to customers over a variety of new channels

2 - Modeling Microservices

Good service boundaries

  • Information hiding
    • By keeping the number of assumptions small, it is easier to ensure that we can change one module without impacting others
The connections between modules are the assumptions which the modules make about each other.
  • Cohesion
The code that changes together, stays together
  • Coupling: services should be loosely coupled
    • Loosely coupled service: it should know as little as it needs to about the services with which it collaborates
A structure is stable if cohesion is strong and coupling is low
  • There are no absolute best way to organize our code. We have to make the right balance between the ideas of coupling and cohesion

Types of coupling

Domain coupling

  • When one service needs functionality of another service
flowchart LR
  Order -->|reserve| Warehouse
  Order -->|take payment| Payment
  • This should be kept at minimum
  • Information hiding: Share only what you absolutely have to, and send only the absolute minimum amount of data you need

Temporal coupling

  • One microservice needs another microservice at the same time for the operation to be completed
  • Difficult to scale -> use some form of async communications like message broker
flowchart LR
  OrderProcessor -->|Sync blocking network call| Warehouse

Pass-through coupling

Passing data to another service purely because this data is needed by some other downstream microservices

flowchart LR
  OrderProcessor -->|Shipping manifest| Warehouse
  Warehouse -->|Shipping manifest| Shipping
  • A change to data downstream can cause a significant upstream change
  • Potential solutions
    • Bypass the intermediary -> Talking directly to the downstream service
      • Increase domain coupling (Order needs to know Shipping)
      • Increase logic complexity for Order which previously hidden by Warehouse (stocks have to be reserved in Warehouse before shipping - Order service now manages this)
    • Hide downstream details: Order service takes shipping details as its contract, but the Warehouse takes care of building the shipping Manifest
---
title: Bypassing intermediary
---
flowchart LR
  Order -->|1 - reserve stock| Warehouse
  Order -->|2 - shipping Manifest| Shipping
  Order -->|3 - remove stock| Warehouse
---
title: Hiding details
---
flowchart LR
  Order -->|order request| Warehouse
  Warehouse -->|dispatch shipping Manifest| Shipping

Common coupling

  • Making use of the same shared resources: database or filesystem
flowchart LR
  Order -->|Set PLACED, PAID, COMPLETED status| OrderTable
  Warehouse -->|Set PICKING, SHIPPED status| OrderTable
  • Common coupling is sometimes OK, but often not
  • Potential solutions:
  • Must ensure that a single microservice manages the order state
  • Or create little more than a thin wrapper microservice for CRUD operations on database
    • A thin wrapper sign of weak cohesion and tighter coupling since the logic to manage data is spread elsewhere in the system instead of this service

Content coupling

  • Upstream service reaches into the internals of a downstream service and changes its internal state. E.g: changing downstream service's DB
  • Different with common coupling: Content Coupling are NOT aware of sharing the common resources, while Common Coupling do

DDD

  • Ubiquitous Language:
  • Aggregate
  • Bounded Contexts

3 - Migrating from Monoliths

Decomposition Patterns

Strangler Fig Patterns

  • Wrapping an old system with the new system over time
flowchart TD
  ExtService -->|request| Interceptor
  Interceptor -->|calls not yet migrated| OldMonolith
  Interceptor -->|calls migrated functions| NewMicroservices

Parallel run

  • Running the same request against the old and new system for comparions

Feature toggle (feature flags)

Data Decomposition Concerns

Performance

  • Can not join from tables, but need to get from services
  • Increase overall latency

Data integrity

Transactions

Reporting database

4 - Communication Styles

Synchronous blocking

Advantages

  • Readable code

Disadvantages

  • Temporal Coupling
  • Response get lost if the upstream service instance died
  • Easily affect performance with downstream or network latency

Usage

  • Simple architectures: length of the call chain is short

Asynchronous non-blocking

Advantages

  • No Temporal Coupling
  • Good for long processing

Disadvantages

  • Complexity
  • Range of choice

Common Data communication

  • A microservice puts data into a defined location, and another microservice make use of the data

Advantages

  • Simple

Disadvantages

  • Downstream services have to work in polling or periodically triggered jobs
  • Can not be used for low-latency situations
  • Introduce Common Coupling

Usage

  • When working with an old system where technology supports are limited, or changes are expensive
  • Sharing large volumes of data

Request - response

Synchronous implementation

  • Upstream service opens network connection with downstream service
  • Connection is kept until downstream service response
  • Downstream needs NOT to know about the caller, just send the response back

Asynchronous implementation

  • Downstream service needs to know where to route the response
  • The upstream service has to store the request states, so it knows how to handle the response

Notes

  • All forms of request - response should implement timeout mechanism

Usage

  • When result of a request is needed before further actions take place
  • When some sorts of compensating system should be carried out (retry) if the request didn't work

Event-Driven Communication

  • A microservice emits an event that may or may be not received by other microservices
  • Reduce coupling
flowchart LR
  Warehouse -.-> Order_Status_Topic
  Order_Status_Topic -.-> Notifications & Inventory

Implementation

  • Message brokers
  • Use HTTP. E.g: Atom

Event details

Just an ID events

  • Contains only the ID of the entity
  • E.g: notification service only needs to know the ID of new customer to send registration emails
  • Cons:
    • Add Domain Coupling: notification service needs more customer details ( email) -> will need to call Customer service
    • Emitting service may get lots of callback requests from downstream services

Fully detailed events

  • Pros:
    • Receiver doesn't need to know about other microservices
    • Good for auditing or reconstitute an entity
  • Cons:
    • Size limitation for message brokers

Consideration

  • Complexity: handle request-response async
  • Ensure good monitoring and use correlation IDs

5 - Implementing Microservice Communication

Looking for ideal technology

Considerations

  • Backward compatibility
  • Make interface explicit
  • Make service simple for consumers
  • Hide internal implementation detail

RPC

  • Making calls between services as local call
  • Requires a schema type: SOAP or gRPC
  • Java RMI: don't need service definitions but will need both using Java
  • Normally works with a serialization protocol. E.g: protobuf (gRPC)
  • Pros:
    • Ease of client code generation
  • Cons:
    • Technology coupling: some RPC mechanism (like Java RMI) are tied to a specific platform
    • Performance: still needs serialization, deserialization as well as connection time
      • Needs special treatments or developers could use it in wrong way
    • Redundancy: unused fields can't be safely removed
  • Do not abstract the RPC implementation too much.

REST

  • Most important thing to know: resources
  • The server creates different representations of a Resource on request
  • There are different styles of REST. See Richardson Maturity Model
  • REST doesn't talk about underlying protocols, although it is mostly used over HTTP
  • HTTP has some defined capabilities to play well with REST
  • HATEOAS (Hypermedia as the engine of application state)
  • Challenges:
    • Generate client code. OpenAPI to the rescue
    • Performance: HTTP overhead, HATEOAS styled APIs...
  • Good parts:
    • Easily understood and integrated
    • Large scales and effective caching requests
    • Great for synchronous request-response

GraphQL

  • Clients can define piece of information to query without calling multiple requests
  • Challenges
    • Dynamic queries can cause performance issues which are hard to traced
    • Client side caching or CDN caching is harder
    • Write data not as good
  • Good for:
    • User GUI
    • External services
  • Alternatives: Backend for Frontend pattern

Message brokers

  • Tend to provide queues or topics or both
    • Queues: typically point to point
    • Topic: multiple subscribers to one topic and all receives the message. Sender unknown of the receiver
  • Topics are good for event-based, while queues are good for request-response
  • Guaranteed delivery: The broker ensures the message is delivered
    • If downstream service is unavailable, broker holds the message
    • Less responsibility for upstream service
    • Have to trust the broker
  • Delivery order: not always guaranteed
  • Be sure to handle duplicated message

Serialization formats

Textual

  • Readable
  • JSON: de facto choice
    • Better size than XML
  • Avro: can send schema as part of the payload
    • JSON is underlying format

Binary

  • Payload size choice
  • If time to read / write the payload is to be reduced
  • Protocol buffers: de facto choice

Avoid breaking changes

  • Expansion changes
    • Add new things to a microservice interface, don't remove old things
  • Tolerant reader
    • Don't break if the consumers send more than expected
  • Right technology
    • Pick technologies that makes it easier to make backward compatible changes to the interface
  • Explicit interface
    • Be explicit about what a microservice exposes
  • Catch accidental breaking changes early
    • Contract testing

Manage breaking changes

3 options

  • Lockstep deployment
    • Deploy the exposing microservices and consumers at the same time
  • Coexist
    • Run old and new versions microservices side by side
    • Good for canary release
    • Should only do it in short time
  • Emulate the old interface
    • Expose new interface and emulate old ones
    • needs to have version in API path

Sharing code via libraries

  • The use of sharing libraries can create coupling, and requires unnecessary redeployment
  • Have to keep in mind that multiple versions coexist at the same time

Service Discovery

DNS

  • Using domains to point to IP address
  • DNS entries are cached (TTL) -> Using a load balancer
flowchart TD
  Client --> LoadBalancer
  LoadBalancer --> Inventory_Instance1 & Inventory_Instance2

Dynamic Service Registries

  • Service registers itself with central registry

API Gateways

  • Focus on north-south traffics
  • Mapping requests from ext parties to internal services
  • Act as a reverse proxy
  • May also implements features like API keys, logging, rate limiting, external customer portals
  • Bad use cases:
    • Call aggregation
    • Protocol rewriting
    • Intermediary for east-west calls (internal services call)

Service meshes

  • Common functionalities are pushed into the mesh
  • Including: timeout, mTLS, correlation IDs, service discovery, load balancing...
flowchart TD
  Service_Mesh_CP -.-> Mesh_Proxy_1 & Mesh_Proxy_2
  subgraph Local_Machine1
  Order_Processor --> Mesh_Proxy_1
  end
  subgraph Local_Machine2
  Mesh_Proxy_2 --> Payment
  end
  Mesh_Proxy_1 -->|Remote Network Call| Mesh_Proxy_2

6 - Workflow

Database Transactions

  • ACID Transactions

Distributed Transactions - 2 Phase Commits (2PC)

  • Voting phase: central coordinator confirms with all workers to see if changes can be made
  • Commit phase
flowchart TD
  Coordinator -->|1.Update row 1?| WorkerA
  WorkerA -->|2. Yes| Coordinator
  Coordinator -->|1. Delete row 2?| WorkerB
  WorkerB -->|2. Yes| Coordinator
  subgraph Customer
  WorkerA --> CustomerDB
  end
  subgraph Enrollments
  WorkerB --> EnrollmentsDB
  end
  
  • Usually implemented by locking
  • The more participants, the more latency the system, and more issues
  • JUST SAY NO to Distributed Transactions

Sagas

image

  • Recover from business failure (Insufficient Funds), not technical failures (Internal Error)
  • Failure modes:
    • Backward recovery: define compensating actions to rollback committed transactions
    • Forward recovery: pickup the failure and keep processing
  • Compensating transaction: an operation that undoes a committed transaction
    • Not working as database rollback
    • This is semantic rollback, as we can't cleanly revert a transaction
    • Rollback information is appropriate to persist in the system
  • Reordering workflow steps to reduce rollbacks
  • Orchestrated sagas: using a central coordinator (or orchestrator)
    • Higher domain coupling
    • Orchestrator takes out logics that belongs to downstream services
    • Good when one team owns implementation of the entire saga
---
title: Orchestrated sagas
---
flowchart LR
  Orchestrator -->|1. Reserve Stock| Warehouse
  Orchestrator -->|2. Take payment| Payment
  Orchestrator -->|3. Award points| Loyalty
  Orchestrator -->|4. Send package| Warehouse
  • Choreoraphed sagas

7 - Build

Currently omitted for more important parts

CI

Repository model

8 - Deployment

Currently omitted for more important parts

9 - Testing

Currently omitted for more important parts

10 - Monitoring to Observability

Building blocks for Observability

Log aggregation:

  • Prerequisite for building a microservice
  • Retrieving logs on machine and make them available centrally
  • Processes logs to local filesystem
  • A local daemon process periodically collects and forward this log to some sorts of store
flowchart LR
  subgraph Host
  MicroserviceInstance -->|Log to local filesystem| Logs
  Logs --> Log-forwarding_Daemon
  end
  Log-forwarding_Daemon -->|Periodically| Log_Aggregation_Tool
  Operators --> Log_Aggregation_Tool

Common Format

  • Date, time, microservice name, log level... in CONSISTENT places in each log
  • Logs format should be done by microservice, not by log forwarder (to avoid performance issue)
  • Must have correlation IDs
  • Timing not really guaranteed to be in correct order due to differences of time in each microservice
    • Using logical clock or distributed tracing to address the issue

Concerns:

  • Using Elasticsearch cautiously, when you want to make sure your logs shouldn't be lost
    • There should be a re-index mechanism
  • Sizes of logs, performance, scalability
    • Pick the right log aggregator
  • Restrict access to logs
    • Do not log certain types of information

Metrics aggregation

  • Collect system metrics for prediction on system health, scaling, capacity planning...
  • These metrics could be stored and reports at different resolution (e.g: sampling rates...)

Cardinality

  • The number of fields can be easily queried in a given data point
  • The more potential fields we want to query of our data, the higher the cardinality we need to support
  • Bear in minds if system can support high-cardinality data

Tools

  • Prometheus, Graphite
  • Honeycomb, Lightstep

Distributed Tracing

image

Implementation

  • Capture span information (supported for standard API like OpenTracing or OpenTelemetry API)
  • Send span information to collector
    • Either send directly from service or using a forwarding agent.
    • Using an agent allows more advanced capabilities like changing sampling or adding tags, or buffer information
  • Have a collector receive this information

Tools

  • Jaeger (Open source)
  • Honeycomb, Lightstep
  • Something should support the OpenTelemetry API

Are we doing OK?

  • How well are we or the systems doing
  • SLA (Service Level Agreement)
    • An agreement between people building and using the system
    • Describe user expectation and what happens if this level of behavior is not reached
  • SLOs (Service Level Objectives)
    • Define what the team sign up to provide
    • Also describes goals not described in SLAs
  • SLI (Service Level Indicators)
    • Measures of something our software does. E.g: response time of a process...
  • Error Budgets
    • How much error is acceptable in a system
    • Encouraging trying of new things or changes

Alerting

Testing in production

  • Synthetic Transaction: making fake user behavior into production system
    • Using end-to-end tests of service or even whole system
    • Take care of user data or side effects
  • A/B testing: deploying 2 different versions of the same functionality
    • Deciding on how something should be done
  • Canary release: enable the new functionalities for a small number of users
    • Revert the changes when needed
  • Parallel run: execute 2 different implementations of the same function side by side
    • Results can be compared
  • Smoke tests: simple or making synthetic test transactions after system is deployed, but before it is release
  • Chaos engineering: injection of faults into production system to see if it can handle these expected issues

11 - Security

12 - Resiliency

Concepts

  • Robustness: ability to absorb expected perturbation
    • Things like network/hard drive failure, service unavailable...
    • Requires prior knowledge to failures
    • Can introduce a new layer of complexity and potentially sources of new issues
  • Rebound: ability to recover after a traumatic event
    • Requires preparing for these situations in advance
  • Graceful extensibility: how well we deal with a situation that is unexpected
  • Sustained adaptability: ability to continually adapt to changing environments, stakeholders and demands
Failures are something likely happens so we must prepare in advance, not only to prevent but also how we handle it

Defining limits

  • Response time/latency: how long should operations take when having number of concurrent users
  • Availability: acceptable downtime
  • Durability of data: how much data loss is acceptable, how long should data be kept for
  • Should be defined together with SLA/SLOs