DevOps - kamialie/knowledge_corner GitHub Wiki

Overview

DevOps relies on bodies of knowledge from Lean, Theory of Constraints, the Toyota Production System, resilience engineering, learning organizations, safety culture, human factors, and many others.

Two of Lean’s major tenets include the deeply held belief that manufacturing lead time required to convert raw materials into finished goods was the best predictor of quality, customer satisfaction, and employee happiness, and that one of the best predictors of short lead times was small batch sizes of work.

One key principle of the Agile Manifest was to “deliver working software frequently, from a couple of weeks to a couple of months, with a preference to the shorter timescale,” emphasizing the desire for small batch sizes, incremental releases instead of large, waterfall releases. Other principles emphasized the need for small, self- motivated teams, working in a high-trust management model.

Improvement kata - requires creating structure for the daily, habitual practice of improvement work, because daily practice is what improves outcomes. The constant cycle of establishing desired future states, setting weekly target outcomes, and the continual improvement of daily work is what guided improvement at Toyota.

DevOps ideal of deployment lead times (from customer request to running in production) of minutes is achieved by continually checking small code changes into version control repository, performing automated and exploratory testing against it, and deploying it into production. This enables us to have a high degree of confidence that our changes will operate as designed in production and that any problems can be quickly detected and corrected. This is most easily achieved when we have architecture that is modular, well encapsulated, and loosely-coupled so that small teams are able to work with high degrees of autonomy, with failures being small and contained, and without causing global disruptions.

DevOps characteristics:

two-way collaboration between Development and Operation teams (tooling, knowledge)
strong involvment of devs
operations need some development skills (at least Bash or Python)
developers need operational skills (not only writing code, but also delpoying and monitoring alerts)
automation (if it isn't automated it is broken)
ideally self-service for devs, at least deployment (CI/CD pipelines)
GitOps
cultural: hierarchy < process
microservices > monolith

Three ways

Flow

Flow principles enable the fast flow of work from left to right.

Make work visible
Limit Work In Process (WIP)
Reduce batch sizes
Reduce the number of handoffs
Continually identify and elevate constraints
Eliminate hardships and waste in the value stream

Make work visible

Example are kanban boards, sprint planning boards - ideally across the entire work stream from backlog to production. Sections should exist both for work station stages, such as IN PROGRESS and DONE, as well as for work stations themselves, such as development, operations, testing, etc.

Limit Work In Process

Studies have shown that the time to complete even simple tasks, such as sorting geometric shapes, significantly degrades when multitasking. Multitasking can be limited by setting maximum number of tasks (cards) per section in kanban.

"Controlling queue size [WIP] is an extremely powerful management tool, as it is one of the few leading indicators of lead time" - Dominica DeGrandis.

"Stop starting. Start finishing" - David J. Andersen

Reduce batch sizes

Simple newsletter mailing simulation described in Lean Thinking: Banish Waste and Create Wealth in Your Corporation by James P. Womack and Daniel T. Jones.

"Envelope game" (fold, insert, seal, and stamp) from Single Piece Flow: Why mass production isn’t the most efficient way of doing ‘stuff

Small batch sizes result in less WIP, faster lead times, faster detection of errors, and less rework. The equivalent to single piece flow in the technology value stream is realized with continuous deployment, where each change committed to version control is integrated, tested, and deployed into production.

Reduce the number of handoffs

More value stream add up queue and knowledge transfer issues, as with enough transfers initial goal can completely lose its context. This is resolved by automating significant portions of the work or reorganizing teams so that they can deliver value to customers themselves, instead of having to be constantly dependent on others.

Continually identify and elevate constraints

Typical DevOps constraint progression:

environment creation - create on demand, self-service
code deployment - automate as much as possible, ideally self-service by any developer
test setup and run - automate runs, parallelize execution (to keep up with code base)
overly tight architecture - loosely-coupled architecture that allows safe and autonomous changes

Eliminate hardships and waste in the value stream

Categories of waste by "Implementing Lean Software Development: From Concept to Cash":

partially done work - not completed or sitting in queue work (requirement documents not reviewed, waiting QA review)
extra processes - work that doesn't add value to the customer (documentation not used, reviews or approvals not adding value)
extra features
task switching - people added to multiple projects or streams, requiring them to context switch
waiting (resources)
motion (amount of effort to move info from one work center to another) - frequently communicating people not co located; handoffs
defects - incorrect, missing or unclear info
nonstandard or manual work
heroics

Feedback

Feedback principles enable fast and constant feedback from right to left from all stages of the value stream. This is done by creating fast, frequent, high quality information flow system throughout value stream and organization.

Complex system challenges:

high degree of interconnectedness of tightly-coupled components
system-level behavior cannot be explained merely in terms of the behavior of the system components (a single person can't see all circumstances and possible outcomes)
doing the same thing twice will not predictably or necessarily lead to the same result

Perfectly safe system is nearly impossible to create, while following these principles will make it safer:

complex work is managed so that problems in design and operations are revealed
problems are swarmed and solved, resulting in quick construction of new knowledge
new local knowledge is exploited globally throughout the organization
leaders create other leaders who continually grow these types of capabilities
[See problems as they occur](#see-problems as they occur)
Swarm and solve problems to build new knowledge
Keep pushing quality closer to the source
Enable optimizing for downstream work centers

See problems as they occur

The goal is to create fast feedback and fast-forward loops wherever work is performed. This includes the creation of automated build, integration, and test processes, so that we can immediately detect when a change has been introduced that takes us out of a correctly functioning and deployable state.

Also pervasive telemetry shows how all system components are operating in the production environment, so that it can quickly detected when they are not operating as expected. Telemetry also allows to measure whether intended goals are achieved and, ideally, is radiated to the entire value stream so team can see how actions affect other portions of the system as a whole.

Swarm and solve problems to build new knowledge

Andon cord in Toyota - is above every work center and is pulled by worker or manager when a problem occurs. Team leader is immediately alerted and works on the problem. If problem isn't fixed with a specified amount of time, the production is halted and entire organization is mobilized to assist fixing the problem.

Swarming is necessary for the following reasons:

prevents problem to go downstream (where cost and effort to repair is higher)
prevents work center to start a new work (most likely will introduce new errors)
if problem isn't addressed, it can potentially occur again on the next operation

Even though in swarming practice a local problem can disrupt operations globally, it prevents the loss of critical information due to fading memories or changing circumstances. It also enables learning. Introduction of new work is also prevented, which enables continuous integration and deployment.

Keep pushing quality closer to the source

Everyone in the value stream should be able to find and fix problems in their area as part of their daily work. This way quality and safety responsibilities and decision-making is pushed to where work is performed, instead of distant executives. Ideally developers should be able to test their code and even deploy it to production.

Enable optimizing for downstream work centers

Lean defines 2 types of customers: the external customer (the one that is likely to pay for the service) and the internal customer (who receives and processes the work immediately after us). According to lean, the most important customer is next downstream.

Operational non-functional requirements (architecture, performance, stability, testability, configurability, security, etc) are prioritized as highly as user features.

Continual learning and Experimentation

Here are the principles that enable constant creation of individual learning, which is then turned into team and organizational knowledge.

Enabling organizational learning and a safety culture
Institutionalize the improvement of daily work
Transform local discoveries into global improvements
Inject resilience patterns into our daily work
Leaders reinforce a learning culture

Enabling organizational learning and a safety culture

Failures should result in reflection and genuine inquiry. Seeking human error and punishment often leads to culture of fear, which in turn results in problems being hidden. General reaction to accidents should be looking for systems redesign to prevent future accidents (post-mortems), thus, enabling organizational learning.

Institutionalize the improvement of daily work

Improvements are necessary since due to chaos and entropy processes actually degrade over time in the absence of improvements.

Daily work is improved by explicitly reserving time to pay down technical debt, fix defects, and refactor and improve problematic areas of code and environments. This is done by reserving cycles in each development interval, or by scheduling kaizen blitzes, which are periods when engineers self-organize into teams to work on fixing any problem they want. The result of these practices is that everyone finds and fixes problems in their area of control, all the time, as part of their daily work

Initially blameless post-mortems can be performed for customer-impacting incidents. Over time, they can also be performed for lesser team-impacting incidents and near misses as well.

Transform local discoveries into global improvements

Global knowledge can be created by making reports should be globally searchable for other teams facing similar problems and by creating shared source code repositories with code, libraries and configurations that embody collective knowledge of entire organization.

Inject resilience patterns into our daily work

Constant experimentation (introducing tension) improves not only performance, but resilience as well, because the organization is always in a state of tension and change.

Antifragility - process of applying stress to increase resilience.

In the technology value stream, tension can be introduced into systems by seeking to always reduce deployment lead times, increase test coverage, decrease test execution times, and even by re-architecting if necessary to increase developer productivity or increase reliability.

Game Day exercises are rehearsals of large scale failures, such as data center shut down. Chaos Monkey (Netflix) - randomly kills processes and compute servers in production.

Leaders reinforce a learning culture

Leader's role is not to "make all the right decisions" (setting objectives, allocating resources, establishing incentives, emotional tone, etc), but rather to create conditions so their team can discover greatness in daily work, since both leaders and workers are dependent upon each other - leaders are not close enough to work, workers do not have broader organizational context nor authority to make decisions outside their area.

Strategic goals create iterative, shorter term goals, which are cascaded and executed by establishing target conditions at the value stream or work center level (mirrors scientific approach).

Iterative method and scientific approach.

References

"Toyota Kata: Managing People for Improvement, Adaptiveness and Superior Results" (by Mike Rother)