[Draft] Microservices - vinhtbkit/bkit-kb GitHub Wiki

1 - Intro

Key concepts

Independent deployability
Modeled around a business domain
Owning their own states
Size

A microservice should be as big as your head

Flexibility: allow solving any problems we might face: org, tech, scale, robustness...
Architecture and organization alignment

Advantages

Technology heterogeneity: using different techs inside each microservice -> pick the right tools for the job
Robustness:
- Isolate the problematic service
Scaling
Ease of deployment:
- Minor changes should not require redeployment of whole system
Organizational alignment
Compose-ability

Pain points

Dev experience
- Local developer resources
Technology overhead
- Different technologies in each service
- New terms introduced
Cost
- Resources: CPU /mem, licences,...
- Learning curves
- Not suitable for cost-reducing organizations
Reporting
- Difficult to gather information
- Has to adopt new techniques
Monitoring and troubleshooting
- Tracking the problematic service
Security
- Information flows over network -> target of MitM attacks
Testing
Latency
Data consistency

Not for

Brand new products or startups
- Things are going to change quickly -> changes to service boundaries
Organizations creating software that will be deployed and managed by their customers (don't have IT backgrounds)

Good for

Big teams
SaaS applications: 247 operations, rolling changes are required
Orgs looking to provide services to customers over a variety of new channels

2 - Modeling Microservices

Good service boundaries

Information hiding
- By keeping the number of assumptions small, it is easier to ensure that we can change one module without impacting others

The connections between modules are the assumptions which the modules make about each other.

Cohesion

The code that changes together, stays together

Coupling: services should be loosely coupled
- Loosely coupled service: it should know as little as it needs to about the services with which it collaborates

A structure is stable if cohesion is strong and coupling is low

There are no absolute best way to organize our code. We have to make the right balance between the ideas of coupling and cohesion

Types of coupling

Domain coupling

When one service needs functionality of another service

flowchart LR
  Order -->|reserve| Warehouse
  Order -->|take payment| Payment

This should be kept at minimum
Information hiding: Share only what you absolutely have to, and send only the absolute minimum amount of data you need

Temporal coupling

One microservice needs another microservice at the same time for the operation to be completed
Difficult to scale -> use some form of async communications like message broker

flowchart LR
  OrderProcessor -->|Sync blocking network call| Warehouse

Pass-through coupling

Passing data to another service purely because this data is needed by some other downstream microservices

flowchart LR
  OrderProcessor -->|Shipping manifest| Warehouse
  Warehouse -->|Shipping manifest| Shipping

A change to data downstream can cause a significant upstream change
Potential solutions
- Bypass the intermediary -> Talking directly to the downstream service
  - Increase domain coupling (Order needs to know Shipping)
  - Increase logic complexity for Order which previously hidden by Warehouse (stocks have to be reserved in Warehouse before shipping - Order service now manages this)
- Hide downstream details: Order service takes shipping details as its contract, but the Warehouse takes care of building the shipping Manifest

---
title: Bypassing intermediary
---
flowchart LR
  Order -->|1 - reserve stock| Warehouse
  Order -->|2 - shipping Manifest| Shipping
  Order -->|3 - remove stock| Warehouse

---
title: Hiding details
---
flowchart LR
  Order -->|order request| Warehouse
  Warehouse -->|dispatch shipping Manifest| Shipping

Common coupling

Making use of the same shared resources: database or filesystem

flowchart LR
  Order -->|Set PLACED, PAID, COMPLETED status| OrderTable
  Warehouse -->|Set PICKING, SHIPPED status| OrderTable

Common coupling is sometimes OK, but often not
Potential solutions:
Must ensure that a single microservice manages the order state
Or create little more than a thin wrapper microservice for CRUD operations on database
- A thin wrapper sign of weak cohesion and tighter coupling since the logic to manage data is spread elsewhere in the system instead of this service

Content coupling

Upstream service reaches into the internals of a downstream service and changes its internal state. E.g: changing downstream service's DB
Different with common coupling: Content Coupling are NOT aware of sharing the common resources, while Common Coupling do

DDD

Ubiquitous Language:
Aggregate
Bounded Contexts

3 - Migrating from Monoliths

Decomposition Patterns

Strangler Fig Patterns

Wrapping an old system with the new system over time

flowchart TD
  ExtService -->|request| Interceptor
  Interceptor -->|calls not yet migrated| OldMonolith
  Interceptor -->|calls migrated functions| NewMicroservices

Parallel run

Running the same request against the old and new system for comparions

Feature toggle (feature flags)

Data Decomposition Concerns

Performance

Can not join from tables, but need to get from services
Increase overall latency

Data integrity

Transactions

Reporting database

4 - Communication Styles

Synchronous blocking

Advantages

Readable code

Disadvantages

Temporal Coupling
Response get lost if the upstream service instance died
Easily affect performance with downstream or network latency

Usage

Simple architectures: length of the call chain is short

Asynchronous non-blocking

Advantages

No Temporal Coupling
Good for long processing

Disadvantages

Complexity
Range of choice

Common Data communication

A microservice puts data into a defined location, and another microservice make use of the data

Advantages

Simple

Disadvantages

Downstream services have to work in polling or periodically triggered jobs
Can not be used for low-latency situations
Introduce Common Coupling

Usage

When working with an old system where technology supports are limited, or changes are expensive
Sharing large volumes of data

Request - response

Synchronous implementation

Upstream service opens network connection with downstream service
Connection is kept until downstream service response
Downstream needs NOT to know about the caller, just send the response back

Asynchronous implementation

Downstream service needs to know where to route the response
The upstream service has to store the request states, so it knows how to handle the response

Notes

All forms of request - response should implement timeout mechanism

Usage

When result of a request is needed before further actions take place
When some sorts of compensating system should be carried out (retry) if the request didn't work

Event-Driven Communication

A microservice emits an event that may or may be not received by other microservices
Reduce coupling

flowchart LR
  Warehouse -.-> Order_Status_Topic
  Order_Status_Topic -.-> Notifications & Inventory

Implementation

Message brokers
Use HTTP. E.g: Atom

Event details

Just an ID events

Contains only the ID of the entity
E.g: notification service only needs to know the ID of new customer to send registration emails
Cons:
- Add Domain Coupling: notification service needs more customer details ( email) -> will need to call Customer service
- Emitting service may get lots of callback requests from downstream services

Fully detailed events

Pros:
- Receiver doesn't need to know about other microservices
- Good for auditing or reconstitute an entity
Cons:
- Size limitation for message brokers

Consideration

Complexity: handle request-response async
Ensure good monitoring and use correlation IDs

5 - Implementing Microservice Communication

Looking for ideal technology

Considerations

Backward compatibility
Make interface explicit
Make service simple for consumers
Hide internal implementation detail

RPC

Making calls between services as local call
Requires a schema type: SOAP or gRPC
Java RMI: don't need service definitions but will need both using Java
Normally works with a serialization protocol. E.g: protobuf (gRPC)
Pros:
- Ease of client code generation
Cons:
- Technology coupling: some RPC mechanism (like Java RMI) are tied to a specific platform
- Performance: still needs serialization, deserialization as well as connection time
  - Needs special treatments or developers could use it in wrong way
- Redundancy: unused fields can't be safely removed
Do not abstract the RPC implementation too much.

REST

Most important thing to know: resources
The server creates different representations of a Resource on request
There are different styles of REST. See Richardson Maturity Model
REST doesn't talk about underlying protocols, although it is mostly used over HTTP
HTTP has some defined capabilities to play well with REST
HATEOAS (Hypermedia as the engine of application state)
Challenges:
- Generate client code. OpenAPI to the rescue
- Performance: HTTP overhead, HATEOAS styled APIs...
Good parts:
- Easily understood and integrated
- Large scales and effective caching requests
- Great for synchronous request-response

GraphQL

Clients can define piece of information to query without calling multiple requests
Challenges
- Dynamic queries can cause performance issues which are hard to traced
- Client side caching or CDN caching is harder
- Write data not as good
Good for:
- User GUI
- External services
Alternatives: Backend for Frontend pattern

Message brokers

Tend to provide queues or topics or both
- Queues: typically point to point
- Topic: multiple subscribers to one topic and all receives the message. Sender unknown of the receiver
Topics are good for event-based, while queues are good for request-response
Guaranteed delivery: The broker ensures the message is delivered
- If downstream service is unavailable, broker holds the message
- Less responsibility for upstream service
- Have to trust the broker
Delivery order: not always guaranteed
Be sure to handle duplicated message

Serialization formats

Textual

Readable
JSON: de facto choice
- Better size than XML
Avro: can send schema as part of the payload
- JSON is underlying format

Binary

Payload size choice
If time to read / write the payload is to be reduced
Protocol buffers: de facto choice

Avoid breaking changes

Expansion changes
- Add new things to a microservice interface, don't remove old things
Tolerant reader
- Don't break if the consumers send more than expected
Right technology
- Pick technologies that makes it easier to make backward compatible changes to the interface
Explicit interface
- Be explicit about what a microservice exposes
Catch accidental breaking changes early
- Contract testing

Manage breaking changes

3 options

Lockstep deployment
- Deploy the exposing microservices and consumers at the same time
Coexist
- Run old and new versions microservices side by side
- Good for canary release
- Should only do it in short time
Emulate the old interface
- Expose new interface and emulate old ones
- needs to have version in API path

Sharing code via libraries

The use of sharing libraries can create coupling, and requires unnecessary redeployment
Have to keep in mind that multiple versions coexist at the same time

Service Discovery

DNS

Using domains to point to IP address
DNS entries are cached (TTL) -> Using a load balancer

flowchart TD
  Client --> LoadBalancer
  LoadBalancer --> Inventory_Instance1 & Inventory_Instance2

Dynamic Service Registries

Service registers itself with central registry

API Gateways

Focus on north-south traffics
Mapping requests from ext parties to internal services
Act as a reverse proxy
May also implements features like API keys, logging, rate limiting, external customer portals
Bad use cases:
- Call aggregation
- Protocol rewriting
- Intermediary for east-west calls (internal services call)

Service meshes

Common functionalities are pushed into the mesh
Including: timeout, mTLS, correlation IDs, service discovery, load balancing...

flowchart TD
  Service_Mesh_CP -.-> Mesh_Proxy_1 & Mesh_Proxy_2
  subgraph Local_Machine1
  Order_Processor --> Mesh_Proxy_1
  end
  subgraph Local_Machine2
  Mesh_Proxy_2 --> Payment
  end
  Mesh_Proxy_1 -->|Remote Network Call| Mesh_Proxy_2

6 - Workflow

Database Transactions

ACID Transactions

Distributed Transactions - 2 Phase Commits (2PC)

Voting phase: central coordinator confirms with all workers to see if changes can be made
Commit phase

flowchart TD
  Coordinator -->|1.Update row 1?| WorkerA
  WorkerA -->|2. Yes| Coordinator
  Coordinator -->|1. Delete row 2?| WorkerB
  WorkerB -->|2. Yes| Coordinator
  subgraph Customer
  WorkerA --> CustomerDB
  end
  subgraph Enrollments
  WorkerB --> EnrollmentsDB
  end

Usually implemented by locking
The more participants, the more latency the system, and more issues
JUST SAY NO to Distributed Transactions

Sagas

Recover from business failure (Insufficient Funds), not technical failures (Internal Error)
Failure modes:
- Backward recovery: define compensating actions to rollback committed transactions
- Forward recovery: pickup the failure and keep processing
Compensating transaction: an operation that undoes a committed transaction
- Not working as database rollback
- This is semantic rollback, as we can't cleanly revert a transaction
- Rollback information is appropriate to persist in the system
Reordering workflow steps to reduce rollbacks
Orchestrated sagas: using a central coordinator (or orchestrator)
- Higher domain coupling
- Orchestrator takes out logics that belongs to downstream services
- Good when one team owns implementation of the entire saga

---
title: Orchestrated sagas
---
flowchart LR
  Orchestrator -->|1. Reserve Stock| Warehouse
  Orchestrator -->|2. Take payment| Payment
  Orchestrator -->|3. Award points| Loyalty
  Orchestrator -->|4. Send package| Warehouse

Choreoraphed sagas

7 - Build

Currently omitted for more important parts

CI

Repository model

8 - Deployment

Currently omitted for more important parts

9 - Testing

Currently omitted for more important parts

10 - Monitoring to Observability

Building blocks for Observability

Log aggregation:

Prerequisite for building a microservice
Retrieving logs on machine and make them available centrally
Processes logs to local filesystem
A local daemon process periodically collects and forward this log to some sorts of store

flowchart LR
  subgraph Host
  MicroserviceInstance -->|Log to local filesystem| Logs
  Logs --> Log-forwarding_Daemon
  end
  Log-forwarding_Daemon -->|Periodically| Log_Aggregation_Tool
  Operators --> Log_Aggregation_Tool

Common Format

Date, time, microservice name, log level... in CONSISTENT places in each log
Logs format should be done by microservice, not by log forwarder (to avoid performance issue)
Must have correlation IDs
Timing not really guaranteed to be in correct order due to differences of time in each microservice
- Using logical clock or distributed tracing to address the issue

Concerns:

Using Elasticsearch cautiously, when you want to make sure your logs shouldn't be lost
- There should be a re-index mechanism
Sizes of logs, performance, scalability
- Pick the right log aggregator
Restrict access to logs
- Do not log certain types of information

Metrics aggregation

Collect system metrics for prediction on system health, scaling, capacity planning...
These metrics could be stored and reports at different resolution (e.g: sampling rates...)

Cardinality

The number of fields can be easily queried in a given data point
The more potential fields we want to query of our data, the higher the cardinality we need to support
Bear in minds if system can support high-cardinality data

Tools

Prometheus, Graphite
Honeycomb, Lightstep

Distributed Tracing

Implementation

Capture span information (supported for standard API like OpenTracing or OpenTelemetry API)
Send span information to collector
- Either send directly from service or using a forwarding agent.
- Using an agent allows more advanced capabilities like changing sampling or adding tags, or buffer information
Have a collector receive this information

Tools

Jaeger (Open source)
Honeycomb, Lightstep
Something should support the OpenTelemetry API

Are we doing OK?

How well are we or the systems doing
SLA (Service Level Agreement)
- An agreement between people building and using the system
- Describe user expectation and what happens if this level of behavior is not reached
SLOs (Service Level Objectives)
- Define what the team sign up to provide
- Also describes goals not described in SLAs
SLI (Service Level Indicators)
- Measures of something our software does. E.g: response time of a process...
Error Budgets
- How much error is acceptable in a system
- Encouraging trying of new things or changes

Alerting

Testing in production

Synthetic Transaction: making fake user behavior into production system
- Using end-to-end tests of service or even whole system
- Take care of user data or side effects
A/B testing: deploying 2 different versions of the same functionality
- Deciding on how something should be done
Canary release: enable the new functionalities for a small number of users
- Revert the changes when needed
Parallel run: execute 2 different implementations of the same function side by side
- Results can be compared
Smoke tests: simple or making synthetic test transactions after system is deployed, but before it is release
Chaos engineering: injection of faults into production system to see if it can handle these expected issues

11 - Security

12 - Resiliency

Concepts

Robustness: ability to absorb expected perturbation
- Things like network/hard drive failure, service unavailable...
- Requires prior knowledge to failures
- Can introduce a new layer of complexity and potentially sources of new issues
Rebound: ability to recover after a traumatic event
- Requires preparing for these situations in advance
Graceful extensibility: how well we deal with a situation that is unexpected
Sustained adaptability: ability to continually adapt to changing environments, stakeholders and demands

Failures are something likely happens so we must prepare in advance, not only to prevent but also how we handle it

Defining limits

Response time/latency: how long should operations take when having number of concurrent users
Availability: acceptable downtime
Durability of data: how much data loss is acceptable, how long should data be kept for
Should be defined together with SLA/SLOs

[Draft] Microservices - vinhtbkit/bkit-kb GitHub Wiki

1 - Intro

Key concepts

Advantages

Pain points

Not for

Good for

2 - Modeling Microservices

Good service boundaries

Types of coupling

Domain coupling

Temporal coupling

Pass-through coupling

Common coupling

Content coupling

DDD

3 - Migrating from Monoliths

Decomposition Patterns

Strangler Fig Patterns

Parallel run

Feature toggle (feature flags)

Data Decomposition Concerns

Performance

Data integrity

Transactions

Reporting database

4 - Communication Styles

Synchronous blocking

Advantages

Disadvantages

Usage

Asynchronous non-blocking

Advantages

Disadvantages

Common Data communication

Advantages

Disadvantages

Usage

Request - response

Synchronous implementation

Asynchronous implementation

Notes

Usage

Event-Driven Communication

Implementation

Event details

Just an ID events

Fully detailed events

Consideration

5 - Implementing Microservice Communication

Looking for ideal technology

Considerations

RPC

REST

GraphQL

Message brokers

Serialization formats

Textual

Binary

Avoid breaking changes

Manage breaking changes

Sharing code via libraries

Service Discovery

DNS

Dynamic Service Registries

API Gateways

Service meshes

6 - Workflow

Database Transactions

Distributed Transactions - 2 Phase Commits (2PC)

Sagas

Choreoraphed sagas

7 - Build

CI

Repository model

8 - Deployment

9 - Testing

10 - Monitoring to Observability

Building blocks for Observability

Log aggregation: