CloudServices - henk52/knowledgesharing GitHub Wiki
- https://www.youtube.com/watch?v=2yko4TbC8cI
- ZeroTOProTraining.com - Hasan Mir. What is NoSQL database? Youtube; HandsonERP.
- Fow12: GOTO 2012 - Introduction to NoSQL - Martin Fowler. https://www.youtube.com/watch?v=qI_g07C_Q5I
- NoSQL Distilled A brief guide to the emerging world of polyglot persistence. Martin Fowler, ThoughtWorks.
- Kol19; Microservices with Rust, Denis Kolodin
- ACID: Atomic, Consistent, Isolated, Durable (RDBMS)
- ADC: Application Delivery Controller. https://www.oreilly.com/learning/developer-defined-application-delivery?imm_mid=0ee8c5&cmp=em-webops-na-na-newsltr_20170310
- BASE: (Contrived acronym(Fow))
- CAP Theorem: Consistency, Availability, PartitionTolerance - Pick any 2(Fow)
- Partition; Do you want consistency or availability?(it can be a spectrum/sliding choice).
- It is actually more a Consistency vs Response time.
- SOA: Service Oriented Architecture
- Microservice:
- RDBMS: Tables
- OLAP: Cubes
- NoSQL: Collections (Originally a hashtag for a single meeting).
- Impedance Mismatch problem: Cohesive data(on screen) splattered accross multiple tables.
- Kubernetes
- NGINX
Micro services
- https://www.youtube.com/watch?v=gEeHZwjwehs from Monolith to mServ
- https://www.youtube.com/watch?v=kb-m2fasdDY Saling uber
- https://www.youtube.com/watch?v=pTf5mqOrwvY bld high perf team
- Real World Microservices https://www.youtube.com/watch?v=1aaw7iYS_VM&t=301s
- A Microservices Reference Architecture - https://www.youtube.com/watch?v=KHqMPRA6jVI
- The State of the Art in Microservices by Adrian Cockcroft - https://www.youtube.com/watch?v=pwpxq9-uw_0
Docker
- Faster, Cheaper and Safer: Secure Microservice Architectures using Docker - https://www.youtube.com/watch?v=zDuTIZBh5_Q
Kafka
- kafka event bus
- Waht is Apachae Kafka - https://www.youtube.com/watch?v=mAgmwHHR6xY
- Developing Real-Time Data Pipelines with Apache Kafka - https://www.youtube.com/watch?v=GRPLRONVDWY
- I recommend that you create and maintain services that determine your competitive advantage yourself and then use third-party services for other tasks(Kol19?).
The Twelve-Factor App approach is a methodology for building Software as a Service (SaaS) applications to fulfill the following three objectives(Kol19,p59):
-
Configurations in declarative formats
-
Maximum portability with operating systems and clouds
-
Continuous deployment and scaling
-
Scalability
-
Performance
-
High availability
-
Less functionality than RDBMS, More performance.
-
Encapsulate the DB
Characteristics(Fow)
- non-relational
- open-source
- cluster-friendly: able to run on big clusters.
- 21st Century Web
- Schema-less
Data model/Type of NoSQL storage
- Key Values Store.
- Memcached
- Coherence
- Redis
- Project Voldemort
- riak
- Tabular:
- BigTable
- Hbase
- Accumulo
- Dynamo?
- Document Oriented(complex data) - usually done in json, but could be done in XML.
- MongoDB
- Couch DB
- Cloudant
- RAVEN DB
- Column.family
- Cassandra
- Apache HBASE
- Graph (Tend to be ACID)
- Neo4J
What is Missing:
- No 'Joins' support
- No cpmplex transactions support
- No constraints support
What is available:
- Query Language(Other than SQL)
- Fast performance
- Horizontal Scalability
When to use:
- The ability to store and retrieve great quantities of data is important.
- Storing relationships between the elements is not important
- Dealing with growing lists of elements: twitter posts, internet server logs, blogs
- The data is not structured or the structure is changing with time.
- Prototypes or fast applications need to be developed.
- Constraints and validations logic is not required to be implemented in database.
When not to use:
- Complex transactions need to be handled
- Joins must be handled by databases. Validations must be handled by databases.
Fowlers quick select:
- User sessions: Redis
- Financial data: RDBMS
- Shopping cart: Riak
- Recomendations: Neo4J
- Product catalog: MongoDB
- Reporting: RDBMS
- Analytics: Cassandra
- User activity logs: Cassandra
Keep your transactions within a single agragate(Fow).
Use delta/diff from each update.
Notice that multiple applications can add columns to the same row without conflicting–they only conflict if they try to update a column with the same name. Hence, we can resolve the conflict by "moving" the column name in the Fields table to the row key, creating a compound key: /. (For this example, we're using a slash ‚'/' to separate the two parts of the key.) This will yield one row per field name/field value pair.
It also doesn't matter which order the column updates occur since they will all be added to the same row. The only rule we have to follow is that each log record is processed by a single update application.
-
http://www.allthingsdistributed.com/2008/12/eventually_consistent.html
-
https://highlyscalable.wordpress.com/2012/09/18/distributed-algorithms-in-nosql-databases/
-
https://pettermahlen.com/2011/10/18/do-nosql-databases-make-consistency-too-hard/
-
https://www.xaprb.com/blog/2014/12/08/eventual-consistency-simpler-than-mvcc/
When updating a page:
- Send full structure back, with information on what changed in which fields.
- Apply the updates.
- refresh the whole structure
- Send the structure back to the page
- Kafka
To watch
-
From: https://www.youtube.com/watch?v=yPf5MfOZPY0 Microservices:
-
Very,very small
-
Team size of one to develop/maintain
-
Loosely coupled (including flow)
-
Multiple versions acceptable(encouraged?)
-
self-monitoring of each service
-
Publish interesing "stuff" (w/o explicit requirements)
-
"application" seems to be a poor conceptualization.
-
Microserices view:
- event bus replaces operational database.
- DB per MicroService (if persistence needed)
- Poly-glot (various NoSQL, SQL)
- Few(10%) writable; even fewer transactional.
-
Rapids: every event
-
Rivers: Themed events
-
Ponds: State/History
Need High-Performance event bus (like kafka) 250k msgs/sec (Both reads and writes counts as a msg) OMQ
-
Always publish to the river
-
Always listen to the river.
-
Event publishing
-
Solution collecting Redis
Chad Fowler vs. Fred George
- Chad: Use Synchronous as default
- Alorith typically described serially
- programmer understanding -> productivity
- Fred: Use Asynchronous as default
- Robustness should be primary goal
- Supports bette de-coupling (which gives easier testing)
- Teach the programmers!
30:37 Post need, select Solution.
34:58 - Service Taxonomy
Microservices like OO
- Conceptualization(job)
- Every service has one job
- If two jobs, make two services
- Communication
- Minimize messages (whether RESTful or Events)
- Encapsulate
- Service has its own persistence
- If sharing persistence, just one logical service
Clojure Loves Shared data
PIGATO - an high-perforamce microservices framework based on RabbitMQ
Taxonomy may be useful(before it's to late)
- Synchroniciti degree
- primary API access to services
- Ratio of number of services and average size
- Zones for clarity
- DB / service ratio
- Expose potential DB hindrances to rapid deployment
-
Values and principals / complexities
-
Autonomy / Communication
- The people responisble for the external interfaces needs to be very good at communication.
-
Speed of change / Execution
- You need to pay for automation of testing and deployment to keep up with the flux of releases.
-
scale / Ressilience
- Providing consistencies.
-
Composabilities / Maintenance
-
Tech diversity / Operational overhead
15min Possibly boundary indicators:
- Domain bounded context(e.g. language change)
- Rate of change
- Team structure (Conways law)
- What hurts most
- E.g. put service in from of pain point legacy services.
20min (neo4j)
- Evolutionary
- One service at a time(not 20 in a bang).
- Back to basic principle
- Single responsibility
- Loose coupling
- High cohesion
- SOLID principles
-
https://www.youtube.com/watch?v=57UK46qfBLY - GOTO 2016 • Microservices at Netflix Scale: Principles, Tradeoffs & Lessons Learned • R. Meshenberg
-
It took 7 years to go completely into the Cloud.
- Buy vs build
- Use or contribute to OSS technologies first.
- Only build what you have to.
- Services should be stateless*
- Must not rely on sticky sessions
- Prove by Chaos testing
- (*Except the persistence/Caching layers)
- Scale out vs. scale up
- if you keep scaling up, you'll hit a limit.
- Horizontal scaling gives you a longer runway.
- Redundancy and isolation for resiliency
- Make more than one of anything
- Isolate the blast radius for any given failure.
- Automate destructive testing
- Simian Army
- Started with Chaos Monkey
Time: 11:00
- Register into service discovery
- Implement externally callable host check
- Be able to verify it is operational.
- Be able to get information on how to connect to other services.
- Verify stateless: Caos monkey
- Randomly kill services
Time: 13:34 Data
- Data - from RDBMS to Cassandra
- Multi-regional replication
- Billing - 15:40
16:35
- Our priorities
-
- Innovation
- Tight coupling doesn't work
- Loose coupling: Each team work independently. End to end ownership
- Develop, test, Deploy, Support
- Architect -> Design -> Develop -> Review -> Test -> Deploy -> Run -> Support -> Architect...
- Separation of concerns: 19:23
- UI: Feature A > Feature B > Feature C
- Personalization: Feature D > A/B test E
- Mid-tier: A/B Test F > Feature H
- Infrastructure: Availabililty > Scalability > Security
-
- Reliability
-
- Efficiency
-
20:00
- Microserives is an Org change
- Evolving the organization
- Central infrastructure investment
- Migration doesn't happen overnight(Roman horse riding(two horses, one rider))
- Living in the hybrid world
- Supporting 2 tech stacks
- Double the maintenane
- Multi-master data replication
23:16
- IPC is crucial for loose coupling
- Common language between the services
- Establishes the contract of interaction
- Caching to protect DBs: 24:56 (Most heavily hit DBs) 1 Read from cache 2 on cache miss call service 3 service calls DB and responds 4 service updates the cache
-
Of you can't see it, you can't improve it.
-
Will your telemetry scale
- Observe -> orient -> Decide -> Act -> Observe...
- 20 mil metrics per sec at netflix
- Some of this data will be fed into automated error correction tools
-
You don't have the luxury of an arcitectural diagrams because things change all the time, you must have run-time dicern:
- who calls who
- how
- where are the errors
- where is traffic flowing
- is there congestion into the system
- Cascading failures affect the whole uptime
- Curcuite breaker
- Detect the problem
- Is it
- If not fatal, go to fall backs:
- Hystrix
- FIT: Failure Injection Test framework
- latency
- others
- If not fatal, go to fall backs:
- Mothly randomly select on region, and fail it. As a user, you will not see a thing (Chaos Kong)
- Containers change the level of encapsulation from VM to process
- Containers can help deliver great developer experience
- To run containers in production at scale... 37:35
- Titus UI/API
- Fenzo
- Cassandra
- Zookeeper
- Docker
- mesos
- netflix.github.com
-
Microservices at scale require organizational change and centralized infrastructure investment.
-
Be aware of your situation and what works for you. 41:33
-
Zuul: Front end proxy
-
Deployment: Asgaard, spinaker
-
Visual : Flux and Flow
Probably ties into ADC.
failures and errors will be detected, mitigated, and resolved before they bring down any part of the microservice ecosystem.
Monitoring a production-ready microservice has four components (https://www.oreilly.com/learning/monitoring-a-production-ready-microservice?imm_mid=0ee8c5&cmp=em-webops-na-na-newsltr_20170310)
- The first is proper logging of all relevant and important information,
- which allows developers to understand the state of the microservice at any time in the present or in the past.
- The second is the use of well-designed dashboards that accurately reflect the health of the microservice,
- and are organized in such a way that anyone at the company could view the dashboard and understand the health and status of the microservice without difficulty.
- The third component is actionable and effective alerting on all key metrics,
- a practice that makes it easy for developers to mitigate and resolve problems with the microservice before they cause outages.
- The final component is the implementation and practice of running a sustainable on-call rotation responsible for the monitoring of the microservice.
-
the behavior of a microservice is the sum of its behavior across all of its instantiations
-
identifying which properties of a microservice are necessary and sufficient for describing its behavior, and then determining what changes in those properties tell us about the overall status and health of the microservice.
-
Host and infrastructure metrics are those that pertain to the status of the infrastructure and the servers on which the microservice is running,
-
while microservice metrics are metrics that are unique to the individual microservice.
-
the CPU utilized by the microservice on each host,
-
the RAM utilized by the microservice on each host,
-
the available threads,
-
the microservice’s open file descriptors (FD),
-
and the number of database connections that the microservice has to any databases it uses.
developers should be able to know how much CPU their microservice is using on one particular host and how much CPU their microservice is using across all hosts it runs on.
Microservice key metrics:
- Language-specific metrics
- Availability
- SLA
- Latency
- Endpoint success
- Endpoint responses
- Endpoint response times
- Clients
- Errors and exceptions
- Dependencies
we also must monitor the availability of the service, the service-level agreement (SLA) of the service, latency (of both the service as a whole and its API endpoints), success of API endpoints, responses and average response times of API endpoints, the services (clients) from which API requests originate (along with which endpoints they send requests to), errors and exceptions (both handled and unhandled), and the health and status of dependencies.
Importantly, all key metrics should be monitored everywhere that the application is deployed. This means that every stage of the deployment pipeline should be monitored. Staging must be closely monitored in order to catch any problems before a new candidate for production (a new build) is deployed to servers running production traffic.
- Logging needs to be such that developers can determine from the logs exactly what went wrong and where things fell apart.
- tracking and logging requests and responses throughout the entire client and dependency chains from end-to-end can illuminate important information about the system that would otherwise go unknown (such as total latency and availability of the stack).
- logging is expensive: they are expensive to store, they are expensive to access, and both storing and accessing logs comes with the additional cost associated with making expensive calls over the network.
- Avoid adding debugging logs in code that will be deployed to production—such logs are very costly.
- If any logs are added specifically for the purpose of debugging, developers should take great care to ensure that any branch or build containing these additional logs does not ever touch production.
- Logging needs to be scalable, it needs to be available, and it needs to be easily accessible and searchable.
- it’s often necessary to impose
- per-service logging quotas
- limits and standards on what information can be logged
- how many logs each microservice can store
- how long the logs will be stored before being deleted.
- it’s often necessary to impose
-
Every microservice must have at least one dashboard where all key metrics (such as hardware utilization, database connections, availability, latency, responses, and the status of API endpoints) are collected and displayed.
-
A dashboard is a graphical display that is updated in real time to reflect all the most important information about a microservice.
-
Dashboards should be easy to interpret so that an outsider can quickly determine the health of the microservice: anyone should be able to look at the dashboard and know immediately whether or not the microservice is working correctly.
-
A dashboard should also serve as an accurate reflection of the overall quality of monitoring of the entire microservice.
- Any key metric that is alerted on should be included in the dashboard.
-
the exclusion of any key metric in the dashboard will reflect poor monitoring of the service,
-
while the inclusion of metrics that are not necessary will reflect a neglect of alerting (and, consequently, monitoring) best practices.
-
In addition to key metrics, information about each phase of the deployment pipeline should be displayed, though not necessarily within the same dashboard.
- Developers working on microservices that require monitoring a large number of key metrics may opt to set up separate dashboards for each deployment phase
- one for staging
- one for canary
- and one for production
- to accurately reflect the health of the microservice at each deployment phase
- Developers working on microservices that require monitoring a large number of key metrics may opt to set up separate dashboards for each deployment phase
-
developers should never need to watch a microservice’s dashboard in order to detect incidents and outages.
-
To assist in determining problems introduced by new deployments, it helps to include information about when a deployment occurred in the dashboard.
- The most effective and useful way to accomplish this is to make sure that deployment times are shown within the graphs of each key metric.
-
Well-designed dashboards also give developers an easy, visual way to detect anomalies and determine alerting thresholds.
-
Very slight or gradual changes or disturbances in key metrics run the risk of not being caught by alerting, but a careful look at an accurate dashboard can illuminate anomalies that would otherwise go undetected.
-
The detection of failures, as well as the detection of changes within key metrics that could lead to a failure, is accomplished through alerting.
-
Effective and actionable alerting is essential to preserving the availability of a microservice and preventing downtime.
-
Alerts must be set up for all key metrics.
- Any change in a key metric at the host level, infrastructure level, or microservice level that could
- lead to an outage,
- cause a spike in latency,
- or somehow harm the availability of the microservice
- should trigger an alert. Importantly,
- alerts should also be triggered whenever a key metric is not seen.
- Any change in a key metric at the host level, infrastructure level, or microservice level that could
-
Three types of thresholds should be set for each key metric, and have both upper and lower bounds:
- normal
- reflect the usual, appropriate upper and lower bounds of each key metric and shouldn’t ever trigger an alert.
- warning
- Warning thresholds on each key metric will trigger alerts when there is a deviation from the norm that could lead to a problem with the microservice;
- warning thresholds should be set such that they will trigger alerts before any deviations from the norm cause an outage or otherwise negatively affect the microservice.
- Warning thresholds on each key metric will trigger alerts when there is a deviation from the norm that could lead to a problem with the microservice;
- critical.
- should be set based on which upper and lower bounds on key metrics actually cause an outage, cause latency to spike, or otherwise hurt a microservice’s availability.
- normal
-
In an ideal world, warning thresholds should trigger alerts that lead to quick detection, mitigation, and resolution before any critical thresholds are reached.
- In each category, thresholds should be
- high enough to avoid noise,
- but low enough to catch any and all real problems with key metrics.
- In each category, thresholds should be
-
To determine the appropriate thresholds for a new microservice (or even an old one), developers can run load testing on the microservice to gauge where the thresholds should lie.
- Running "normal" traffic loads through the microservice can determine the normal thresholds,
- while running larger-than-expected traffic loads can help determine warning and critical thresholds.
-
The first step is to create step-by-step instructions for each known alert that detail how to triage, mitigate, and resolve each alert.
-
Runbooks are crucial to the monitoring of a microservice: they allow any on-call developer to have step-by-step instructions on how to mitigate and resolve the root causes of each alert.
-
any alert that, once triggered, requires a simple set of steps to be taken in order to be mitigated and resolved, can be easily automated away.
-
Once this level of production-ready monitoring has been established, a microservice should never experience the same exact problem twice.
To prevent burnout, on-call rotations should be both brief and shared: no fewer than two developers should ever be on call at one time, and on-call shifts should last no longer than one week and be spaced no more frequently than one month apart.
-
Be careful of the depth of the dependency tree
- Imagine that you have a microservice that has to wait for the response of another microservice to send a response to a client. The other microservice, in turn, also has to wait for another microservice, and so on(Kol18, p115).
-
loose coupling means that a microservice doesn't know anything about other microservices, or how many there are(Kol18, p115).
-
message-driven, when you use messages as a unit of interaction(Kol18, p116).
- To have totally uncoupled microservices, you should use a message queue or a message broker service(Kol18, p116).
- I guess the downside is delay in responses
-
Claim: if your microservices have to process hundreds of thousands of messages, you should use asynchronous code(Kol18, p117).
- I assume that this claim requires you to be able to throw a lot of resources at the micro service,
- I also assume that for the same CPU power the async application might be slower in responding, compared to the single threaded app
-
Connecting microservices:
- Message broker
- Remote procedure calls(RPC)(Kol18, p118)
- JSon-RPC
- gRPC/protobuf
- Thrift
- XML-RPC