Converger battle plan - rackerlabs/otter GitHub Wiki
Disclaimer: this is a rough draft of a bunch of ideas, and not something anyone has actually committed to
Goal: create a clear plan forward for implementing convergence. Core idea: incremental change, mitigating risk.
As explained in sprint planning RCv3 recap and demonstrated by RCv3
implementation effort: take pieces of worker code, replace them with
their Step
counterpart.
- Desired outcome: validate individual
Step
components & relatedpure_http
code. (internal) - Introduces:
Step
implementations and relatedeffect
code - Deprecates: old side-effecty
treq
worker code
This does not have to be done entirely for it to be useful. Alternatively, we can discover some of these (and have started discovering some of these) by manual inspection of the old code’s test suite.
Risk | Mitigation |
---|---|
API incomplete | One step at a time, discover as we use it |
Bugs in new code | One step at a time, fast rollback |
- Desired outcome: validate convergence cycle logic (internal)
- Introduces: convergence cycle/computation logic
- Deprecates:
otter.worker.launch_server_v1
and related code
Risk | Mitigation |
---|---|
Bugs in new code | Fast rollback |
- Desired outcome: actual convergence! (external)
- Introduces: observer
- Deprecates: nothing
Risk | Mitigation |
---|---|
Un(der)tested behavior, e.g. detach | Work on functional test suite first |
Too many converge calls: bugs, bad perf | Test this manually in pervious stage |
Several semi-orthogonal projects. Some of these can happen earlier, but that should only be done for a very compelling reason, since it gets in the way of getting convergence done.
This is most likely an incomplete list.
Up until this point, all information used to compute the appropriate convergence cycle comes directly from the source (e.g., Nova). For a variety of reasons (out of scope for this document).
Before implementing this feature for any of these dependent systems, we should have a clear definition of why we’re doing this, and a metric to show if it’s actually helping. How & why: different discussion.
Because this means creating fewer servers than are actually required, this can only be done once the observer has been put into production; it can’t be done while the convergence cycle is acting as the worker replacement. How & why: different discussion.
Current desired capacity is stated by policies. This works, but makes it difficult to allow cooperation.
User story: User is an online retailer, relying on Rackspace Cloud Monitoring for automatic load-based scaling through otter. It’s a few days before Black Friday/Cyber Monday: they know they’re going to want to provision a ton of servers, so they execute a policy to set capacity to 500, even though right now they only need 50. Cloud Monitoring immediately notices that the scaling group is massively overprovisioned, so tells otter to take the hatchet to it. User is very unhappy that the 500 servers they asked for go away.
This is essentially the behavior we have today, e.g. 3rd party service with a webhook, a scheduled event and user action can all interact with each other in unforseen ways. However, this issue is exacerbated by introducing e.g. Cloud Intelligence and Cloud Monitoring support (see above example), together with the convergence feedback loop amplifying the voices of those tools.
At this point, this is more a product question than an engineering question. Is it acceptable to just tell users to disable potentially conflicting desired capacity sources temporarily? In the short term, that’s probably what is going to need to happen, but in the long term, we may want a more pleasant answer to this.
See also: logic programming desired capacity spike by lvh a few months ago. Logic programming (and even that spike in particular) addresses this issue, but that doesn’t mean it necessarily has to be addressed, or addressed in that particular way.
Clearly desirable, and, now, thanks to convergence, can be implemented in an otter-pull (as opposed to CM-push) fashion.
The repo has plenty of stale code already, and convergence will deprecate even more. This gives us an opportunity to clean house. Fortunately, only our REST API is public, so we can restructure with impunity.
One way to do this is to run the functional tests under a code coverage tool, and simply delete any non-covered code. Unfortunately we can’t actually do this, because our functional test suite is incomplete.
However, we can still use this as a tool to find code we should do something with. We can take the coverage report, and then figure out through manual inspection for each piece of code if we should:
- delete it, or
- write functional tests for it.
Ideally, we end up in a situation where we actaully can really just
pipe the code coverage report to git rm
, but we don’t have to
actually reach that point for a garbage collection exercise to be
valuabe.
During the transition to convergence, there are several “gaps” where we’re waiting to see if a given subsystem performs as expected. Since there are some
Can we roll back faster if we allow both code paths, and pick which one to use based on an environment variable? Does our deployment strategy allow us to skip doing an entire deploy in that case?