[Draft] Microservices - vinhtbkit/bkit-kb GitHub Wiki
1 - Intro
Key concepts
- Independent deployability
- Modeled around a business domain
- Owning their own states
- Size
A microservice should be as big as your head
- Flexibility: allow solving any problems we might face: org, tech, scale, robustness...
- Architecture and organization alignment
Advantages
- Technology heterogeneity: using different techs inside each microservice -> pick the right tools for the job
- Robustness:
- Isolate the problematic service
- Scaling
- Ease of deployment:
- Minor changes should not require redeployment of whole system
- Organizational alignment
- Compose-ability
Pain points
- Dev experience
- Local developer resources
- Technology overhead
- Different technologies in each service
- New terms introduced
- Cost
- Resources: CPU /mem, licences,...
- Learning curves
- Not suitable for cost-reducing organizations
- Reporting
- Difficult to gather information
- Has to adopt new techniques
- Monitoring and troubleshooting
- Tracking the problematic service
- Security
- Information flows over network -> target of MitM attacks
- Testing
- Latency
- Data consistency
Not for
- Brand new products or startups
- Things are going to change quickly -> changes to service boundaries
- Organizations creating software that will be deployed and managed by their customers (don't have IT backgrounds)
Good for
- Big teams
- SaaS applications: 247 operations, rolling changes are required
- Orgs looking to provide services to customers over a variety of new channels
2 - Modeling Microservices
Good service boundaries
- Information hiding
- By keeping the number of assumptions small, it is easier to ensure that we can change one module without impacting others
The connections between modules are the assumptions which the modules make about each other.
- Cohesion
The code that changes together, stays together
- Coupling: services should be loosely coupled
- Loosely coupled service: it should know as little as it needs to about the services with which it collaborates
A structure is stable if cohesion is strong and coupling is low
- There are no absolute best way to organize our code. We have to make the right balance between the ideas of
coupling
andcohesion
Types of coupling
Domain coupling
- When one service needs functionality of another service
flowchart LR
Order -->|reserve| Warehouse
Order -->|take payment| Payment
- This should be kept at minimum
- Information hiding: Share only what you absolutely have to, and send only the absolute minimum amount of data you need
Temporal coupling
- One microservice needs another microservice at the same time for the operation to be completed
- Difficult to scale -> use some form of async communications like message broker
flowchart LR
OrderProcessor -->|Sync blocking network call| Warehouse
Pass-through coupling
Passing data to another service purely because this data is needed by some other downstream microservices
flowchart LR
OrderProcessor -->|Shipping manifest| Warehouse
Warehouse -->|Shipping manifest| Shipping
- A change to data downstream can cause a significant upstream change
- Potential solutions
- Bypass the intermediary -> Talking directly to the downstream service
- Increase domain coupling (
Order
needs to knowShipping
) - Increase logic complexity for
Order
which previously hidden byWarehouse
(stocks have to be reserved inWarehouse
before shipping -Order
service now manages this)
- Increase domain coupling (
- Hide downstream details: Order service takes shipping details as its contract, but the
Warehouse
takes care of building the shipping Manifest
- Bypass the intermediary -> Talking directly to the downstream service
---
title: Bypassing intermediary
---
flowchart LR
Order -->|1 - reserve stock| Warehouse
Order -->|2 - shipping Manifest| Shipping
Order -->|3 - remove stock| Warehouse
---
title: Hiding details
---
flowchart LR
Order -->|order request| Warehouse
Warehouse -->|dispatch shipping Manifest| Shipping
Common coupling
- Making use of the same shared resources: database or filesystem
flowchart LR
Order -->|Set PLACED, PAID, COMPLETED status| OrderTable
Warehouse -->|Set PICKING, SHIPPED status| OrderTable
- Common coupling is sometimes OK, but often not
- Potential solutions:
- Must ensure that a single microservice manages the order state
- Or create little more than a thin wrapper microservice for CRUD operations on database
- A thin wrapper sign of weak cohesion and tighter coupling since the logic to manage data is spread elsewhere in the system instead of this service
Content coupling
- Upstream service reaches into the internals of a downstream service and changes its internal state. E.g: changing downstream service's DB
- Different with common coupling:
Content Coupling
are NOT aware of sharing the common resources, whileCommon Coupling
do
DDD
- Ubiquitous Language:
- Aggregate
- Bounded Contexts
3 - Migrating from Monoliths
Decomposition Patterns
Strangler Fig Patterns
- Wrapping an old system with the new system over time
flowchart TD
ExtService -->|request| Interceptor
Interceptor -->|calls not yet migrated| OldMonolith
Interceptor -->|calls migrated functions| NewMicroservices
Parallel run
- Running the same request against the old and new system for comparions
Feature toggle (feature flags)
Data Decomposition Concerns
Performance
- Can not join from tables, but need to get from services
- Increase overall latency
Data integrity
Transactions
Reporting database
4 - Communication Styles
Synchronous blocking
Advantages
- Readable code
Disadvantages
- Temporal Coupling
- Response get lost if the upstream service instance died
- Easily affect performance with downstream or network latency
Usage
- Simple architectures: length of the call chain is short
Asynchronous non-blocking
Advantages
- No Temporal Coupling
- Good for long processing
Disadvantages
- Complexity
- Range of choice
Common Data communication
- A microservice puts data into a defined location, and another microservice make use of the data
Advantages
- Simple
Disadvantages
- Downstream services have to work in polling or periodically triggered jobs
- Can not be used for low-latency situations
- Introduce Common Coupling
Usage
- When working with an old system where technology supports are limited, or changes are expensive
- Sharing large volumes of data
Request - response
Synchronous implementation
- Upstream service opens network connection with downstream service
- Connection is kept until downstream service response
- Downstream needs NOT to know about the caller, just send the response back
Asynchronous implementation
- Downstream service needs to know where to route the response
- The upstream service has to store the request states, so it knows how to handle the response
Notes
- All forms of request - response should implement
timeout
mechanism
Usage
- When result of a request is needed before further actions take place
- When some sorts of compensating system should be carried out (retry) if the request didn't work
Event-Driven Communication
- A microservice emits an event that may or may be not received by other microservices
- Reduce coupling
flowchart LR
Warehouse -.-> Order_Status_Topic
Order_Status_Topic -.-> Notifications & Inventory
Implementation
- Message brokers
- Use HTTP. E.g: Atom
Event details
Just an ID events
- Contains only the ID of the entity
- E.g: notification service only needs to know the ID of new customer to send registration emails
- Cons:
- Add Domain Coupling: notification service needs more customer details ( email) -> will need to call Customer service
- Emitting service may get lots of callback requests from downstream services
Fully detailed events
- Pros:
- Receiver doesn't need to know about other microservices
- Good for auditing or reconstitute an entity
- Cons:
- Size limitation for message brokers
Consideration
- Complexity: handle request-response async
- Ensure good monitoring and use correlation IDs
5 - Implementing Microservice Communication
Looking for ideal technology
Considerations
- Backward compatibility
- Make interface explicit
- Make service simple for consumers
- Hide internal implementation detail
RPC
- Making calls between services as local call
- Requires a schema type:
SOAP
orgRPC
- Java RMI: don't need service definitions but will need both using Java
- Normally works with a
serialization
protocol. E.g:protobuf
(gRPC) - Pros:
- Ease of client code generation
- Cons:
- Technology coupling: some RPC mechanism (like Java RMI) are tied to a specific platform
- Performance: still needs serialization, deserialization as well as connection time
- Needs special treatments or developers could use it in wrong way
- Redundancy: unused fields can't be safely removed
- Do not abstract the RPC implementation too much.
REST
- Most important thing to know:
resources
- The server creates different representations of a
Resource
on request - There are different styles of REST. See Richardson Maturity Model
- REST doesn't talk about underlying protocols, although it is mostly used over HTTP
- HTTP has some defined capabilities to play well with REST
- HATEOAS (Hypermedia as the engine of application state)
- Challenges:
- Generate client code.
OpenAPI
to the rescue - Performance: HTTP overhead, HATEOAS styled APIs...
- Generate client code.
- Good parts:
- Easily understood and integrated
- Large scales and effective caching requests
- Great for synchronous request-response
GraphQL
- Clients can define piece of information to query without calling multiple requests
- Challenges
- Dynamic queries can cause performance issues which are hard to traced
- Client side caching or CDN caching is harder
- Write data not as good
- Good for:
- User GUI
- External services
- Alternatives: Backend for Frontend pattern
Message brokers
- Tend to provide
queues
ortopics
or both- Queues: typically point to point
- Topic: multiple subscribers to one topic and all receives the message. Sender unknown of the receiver
- Topics are good for event-based, while queues are good for request-response
- Guaranteed delivery: The broker ensures the message is delivered
- If downstream service is unavailable, broker holds the message
- Less responsibility for upstream service
- Have to trust the broker
- Delivery order: not always guaranteed
- Be sure to handle duplicated message
Serialization formats
Textual
- Readable
- JSON: de facto choice
- Better size than XML
- Avro: can send schema as part of the payload
- JSON is underlying format
Binary
- Payload size choice
- If time to read / write the payload is to be reduced
- Protocol buffers: de facto choice
Avoid breaking changes
- Expansion changes
- Add new things to a microservice interface, don't remove old things
- Tolerant reader
- Don't break if the consumers send more than expected
- Right technology
- Pick technologies that makes it easier to make backward compatible changes to the interface
- Explicit interface
- Be explicit about what a microservice exposes
- Catch accidental breaking changes early
- Contract testing
Manage breaking changes
3 options
- Lockstep deployment
- Deploy the exposing microservices and consumers at the same time
- Coexist
- Run old and new versions microservices side by side
- Good for canary release
- Should only do it in short time
- Emulate the old interface
- Expose new interface and emulate old ones
- needs to have version in API path
Sharing code via libraries
- The use of sharing libraries can create coupling, and requires unnecessary redeployment
- Have to keep in mind that multiple versions coexist at the same time
Service Discovery
DNS
- Using domains to point to IP address
- DNS entries are cached (TTL) -> Using a load balancer
flowchart TD
Client --> LoadBalancer
LoadBalancer --> Inventory_Instance1 & Inventory_Instance2
Dynamic Service Registries
- Service registers itself with central registry
API Gateways
- Focus on north-south traffics
- Mapping requests from ext parties to internal services
- Act as a reverse proxy
- May also implements features like API keys, logging, rate limiting, external customer portals
- Bad use cases:
- Call aggregation
- Protocol rewriting
- Intermediary for east-west calls (internal services call)
Service meshes
- Common functionalities are pushed into the mesh
- Including: timeout, mTLS, correlation IDs, service discovery, load balancing...
flowchart TD
Service_Mesh_CP -.-> Mesh_Proxy_1 & Mesh_Proxy_2
subgraph Local_Machine1
Order_Processor --> Mesh_Proxy_1
end
subgraph Local_Machine2
Mesh_Proxy_2 --> Payment
end
Mesh_Proxy_1 -->|Remote Network Call| Mesh_Proxy_2
6 - Workflow
Database Transactions
- ACID Transactions
Distributed Transactions - 2 Phase Commits (2PC)
- Voting phase: central coordinator confirms with all workers to see if changes can be made
- Commit phase
flowchart TD
Coordinator -->|1.Update row 1?| WorkerA
WorkerA -->|2. Yes| Coordinator
Coordinator -->|1. Delete row 2?| WorkerB
WorkerB -->|2. Yes| Coordinator
subgraph Customer
WorkerA --> CustomerDB
end
subgraph Enrollments
WorkerB --> EnrollmentsDB
end
- Usually implemented by locking
- The more participants, the more latency the system, and more issues
- JUST SAY NO to Distributed Transactions
Sagas
- Recover from business failure (Insufficient Funds), not technical failures (Internal Error)
- Failure modes:
- Backward recovery: define compensating actions to rollback committed transactions
- Forward recovery: pickup the failure and keep processing
- Compensating transaction: an operation that undoes a committed transaction
- Not working as database rollback
- This is
semantic rollback
, as we can't cleanly revert a transaction - Rollback information is appropriate to persist in the system
- Reordering workflow steps to reduce rollbacks
- Orchestrated sagas: using a central coordinator (or orchestrator)
- Higher domain coupling
- Orchestrator takes out logics that belongs to downstream services
- Good when one team owns implementation of the entire saga
---
title: Orchestrated sagas
---
flowchart LR
Orchestrator -->|1. Reserve Stock| Warehouse
Orchestrator -->|2. Take payment| Payment
Orchestrator -->|3. Award points| Loyalty
Orchestrator -->|4. Send package| Warehouse
-
Choreoraphed sagas
7 - Build
Currently omitted for more important parts
CI
Repository model
8 - Deployment
Currently omitted for more important parts
9 - Testing
Currently omitted for more important parts
10 - Monitoring to Observability
Building blocks for Observability
Log aggregation:
- Prerequisite for building a microservice
- Retrieving logs on machine and make them available centrally
- Processes logs to local filesystem
- A local daemon process periodically collects and forward this log to some sorts of store
flowchart LR
subgraph Host
MicroserviceInstance -->|Log to local filesystem| Logs
Logs --> Log-forwarding_Daemon
end
Log-forwarding_Daemon -->|Periodically| Log_Aggregation_Tool
Operators --> Log_Aggregation_Tool
Common Format
- Date, time, microservice name, log level... in CONSISTENT places in each log
- Logs format should be done by microservice, not by log forwarder (to avoid performance issue)
- Must have correlation IDs
- Timing not really guaranteed to be in correct order due to differences of time in each microservice
- Using
logical clock
or distributed tracing to address the issue
- Using
Concerns:
- Using
Elasticsearch
cautiously, when you want to make sure your logs shouldn't be lost- There should be a re-index mechanism
- Sizes of logs, performance, scalability
- Pick the right log aggregator
- Restrict access to logs
- Do not log certain types of information
Metrics aggregation
- Collect system metrics for prediction on system health, scaling, capacity planning...
- These metrics could be stored and reports at different resolution (e.g: sampling rates...)
Cardinality
- The number of fields can be easily queried in a given data point
- The more potential fields we want to query of our data, the higher the cardinality we need to support
- Bear in minds if system can support high-cardinality data
Tools
- Prometheus, Graphite
- Honeycomb, Lightstep
Distributed Tracing
Implementation
- Capture span information (supported for standard API like OpenTracing or OpenTelemetry API)
- Send span information to collector
- Either send directly from service or using a forwarding agent.
- Using an agent allows more advanced capabilities like changing sampling or adding tags, or buffer information
- Have a collector receive this information
Tools
- Jaeger (Open source)
- Honeycomb, Lightstep
- Something should support the
OpenTelemetry API
Are we doing OK?
- How well are we or the systems doing
- SLA (Service Level Agreement)
- An agreement between people building and using the system
- Describe user expectation and what happens if this level of behavior is not reached
- SLOs (Service Level Objectives)
- Define what the team sign up to provide
- Also describes goals not described in SLAs
- SLI (Service Level Indicators)
- Measures of something our software does. E.g: response time of a process...
- Error Budgets
- How much error is acceptable in a system
- Encouraging trying of new things or changes
Alerting
Testing in production
- Synthetic Transaction: making fake user behavior into production system
- Using end-to-end tests of service or even whole system
- Take care of user data or side effects
- A/B testing: deploying 2 different versions of the same functionality
- Deciding on how something should be done
- Canary release: enable the new functionalities for a small number of users
- Revert the changes when needed
- Parallel run: execute 2 different implementations of the same function side by side
- Results can be compared
- Smoke tests: simple or making synthetic test transactions after system is deployed, but before it is release
- Chaos engineering: injection of faults into production system to see if it can handle these expected issues
11 - Security
12 - Resiliency
Concepts
- Robustness: ability to absorb expected perturbation
- Things like network/hard drive failure, service unavailable...
- Requires prior knowledge to failures
- Can introduce a new layer of complexity and potentially sources of new issues
- Rebound: ability to recover after a traumatic event
- Requires preparing for these situations in advance
- Graceful extensibility: how well we deal with a situation that is unexpected
- Sustained adaptability: ability to continually adapt to changing environments, stakeholders and demands
Failures are something likely happens so we must prepare in advance, not only to prevent but also how we handle it
Defining limits
- Response time/latency: how long should operations take when having number of concurrent users
- Availability: acceptable downtime
- Durability of data: how much data loss is acceptable, how long should data be kept for
- Should be defined together with SLA/SLOs