Converger battle plan - rackerlabs/otter GitHub Wiki

Disclaimer: this is a rough draft of a bunch of ideas, and not something anyone has actually committed to

Introduction

Goal: create a clear plan forward for implementing convergence. Core idea: incremental change, mitigating risk.

Phases

Worker in terms of convergence

As explained in sprint planning RCv3 recap and demonstrated by RCv3 implementation effort: take pieces of worker code, replace them with their Step counterpart.

  • Desired outcome: validate individual Step components & related pure_http code. (internal)
  • Introduces: Step implementations and related effect code
  • Deprecates: old side-effecty treq worker code

This does not have to be done entirely for it to be useful. Alternatively, we can discover some of these (and have started discovering some of these) by manual inspection of the old code’s test suite.

Risk Mitigation
API incomplete One step at a time, discover as we use it
Bugs in new code One step at a time, fast rollback

Convergence cycle as worker

  • Desired outcome: validate convergence cycle logic (internal)
  • Introduces: convergence cycle/computation logic
  • Deprecates: otter.worker.launch_server_v1 and related code
Risk Mitigation
Bugs in new code Fast rollback

Convergence feedback loop

  • Desired outcome: actual convergence! (external)
  • Introduces: observer
  • Deprecates: nothing
Risk Mitigation
Un(der)tested behavior, e.g. detach Work on functional test suite first
Too many converge calls: bugs, bad perf Test this manually in pervious stage

Post convergence projects

Several semi-orthogonal projects. Some of these can happen earlier, but that should only be done for a very compelling reason, since it gets in the way of getting convergence done.

This is most likely an incomplete list.

Caching

Up until this point, all information used to compute the appropriate convergence cycle comes directly from the source (e.g., Nova). For a variety of reasons (out of scope for this document).

Before implementing this feature for any of these dependent systems, we should have a clear definition of why we’re doing this, and a metric to show if it’s actually helping. How & why: different discussion.

Throttling/limiting scaling

Because this means creating fewer servers than are actually required, this can only be done once the observer has been put into production; it can’t be done while the convergence cycle is acting as the worker replacement. How & why: different discussion.

Desired capacity rules

Current desired capacity is stated by policies. This works, but makes it difficult to allow cooperation.

User story: User is an online retailer, relying on Rackspace Cloud Monitoring for automatic load-based scaling through otter. It’s a few days before Black Friday/Cyber Monday: they know they’re going to want to provision a ton of servers, so they execute a policy to set capacity to 500, even though right now they only need 50. Cloud Monitoring immediately notices that the scaling group is massively overprovisioned, so tells otter to take the hatchet to it. User is very unhappy that the 500 servers they asked for go away.

This is essentially the behavior we have today, e.g. 3rd party service with a webhook, a scheduled event and user action can all interact with each other in unforseen ways. However, this issue is exacerbated by introducing e.g. Cloud Intelligence and Cloud Monitoring support (see above example), together with the convergence feedback loop amplifying the voices of those tools.

At this point, this is more a product question than an engineering question. Is it acceptable to just tell users to disable potentially conflicting desired capacity sources temporarily? In the short term, that’s probably what is going to need to happen, but in the long term, we may want a more pleasant answer to this.

See also: logic programming desired capacity spike by lvh a few months ago. Logic programming (and even that spike in particular) addresses this issue, but that doesn’t mean it necessarily has to be addressed, or addressed in that particular way.

Cloud Monitoring integration

Clearly desirable, and, now, thanks to convergence, can be implemented in an otter-pull (as opposed to CM-push) fashion.

Converging on a cleaner code base

The repo has plenty of stale code already, and convergence will deprecate even more. This gives us an opportunity to clean house. Fortunately, only our REST API is public, so we can restructure with impunity.

One way to do this is to run the functional tests under a code coverage tool, and simply delete any non-covered code. Unfortunately we can’t actually do this, because our functional test suite is incomplete.

However, we can still use this as a tool to find code we should do something with. We can take the coverage report, and then figure out through manual inspection for each piece of code if we should:

  • delete it, or
  • write functional tests for it.

Ideally, we end up in a situation where we actaully can really just pipe the code coverage report to git rm, but we don’t have to actually reach that point for a garbage collection exercise to be valuabe.

During the transition to convergence, there are several “gaps” where we’re waiting to see if a given subsystem performs as expected. Since there are some

Open questions

Fast rollback

Can we roll back faster if we allow both code paths, and pick which one to use based on an environment variable? Does our deployment strategy allow us to skip doing an entire deploy in that case?

⚠️ **GitHub.com Fallback** ⚠️