Convergence Test Plan - rackerlabs/otter GitHub Wiki

Convergence Test Plan

Convergence is an Autoscale behavior described in the Convergence Design Document. The release of this feature results in changes to the expected behavior of Autoscale relating to the way in which Autoscale tracks the expected number of servers in a scaling group and recovers from errors. This document provides a description of the test plan that will be used to ensure the quality of the Autoscale Convergence feature.

Functional Tests

Autoscale has an existing suite of functional API tests that thoroughly exercise the Autoscale API. The functional tests are written in Python using the OpenCafe/CloudCafe/autoscale_cloudcafe technology stack and are deployed using [autoscaling-chef]. The tests are all automated and can be run either at the command line or using Jenkins. By default, the functional tests run against Mimic to mock back-end functionality. For further details, see the generalized Autoscale Test Plan [TBC].

As the Convergence workflow is developed, applicable functional tests are tagged to indicate their compliance with convergence. Tests that have been verified to pass are tagged “convergence='yes'” and tests that produce errors are tagged “convergence='error'”. Once development has been completed, most tests should have been examined. At that time, all tests will be reviewed for applicability to convergence and any remaining tests will be categorized.

Integration Tests

The Convergence integration tests fall into two general categories: scenario tests and error injection tests. Scenario tests will execute representative workflows and verify that Autoscale produces the expected behavior. Error injection tests will use tools such as Mimic to simulate potential error conditions that Autoscale is expected to handle. The integration tests will be written using the Twisted Python testing framework, Trial. This was chosen due to the benefits of code reuse from Autoscale, which is written in Twisted, and to take advantage of the asynchronous features of Twisted.

Since there are nearly infinite possible scenario tests, a generative testing too will be used to produce varied scenarios. Specific scenarios identified as interesting or non-conforming upon result examination will be ported into the automated regression test suite in Trial.

Generative Testing

The generative testing tool currently being developed is written in Clojure and is located at https://github.com/lvh/nordschleife. Generative testing works by varying identified paraters to create test scenarios. Below is a list of the variable parameters that will be varied in the generated scenarios. (Note: Not all variable parameters are implemented at this time)

Launch Config

load-balancers: A list of cloud load balancers
networks: a list of networks, always including service-net
server-image-ref ^(0)
server-flavor-ref ^(0)
server-disk-config ^(0)

^(0): The interactions of these parameters are not fully supported in Mimic and would only be informative when run against production hardware

Group Config

cooldown
min-entities
max-entities

Policy

changePercent
cooldown
change
desiredCapacity

Note: Generative testing does not support scheduled policies

Events

Each generated scenario will be composed of a sequence of the following event options:

Scale Up
Scale Down
Scale To
Delete Server (Stretch)
Update Launch Config
Update Group Config

Test Cases

Scale Up: Given the number of permutations of this test, this will use generative testing to exercise various scaling factors, min/max constraints, policy types, and load balancers
- Create a scaling group with various launch configuration data
- Scale up by X servers
- Validate end group state
- CATC-c001: Basic cases
Covered in cafe
Scale Down: Given the number of permutations of this test, this will use generative testing to exercise various scaling factors, min/max constraints, and cloud load balancers
- Create a scaling group with various launch configuration data
- Scale down by X servers
- Validate end group state
- CATC-c002: Basic cases
Covered in cafe
Multiple Scaling Policies: Given the number of permutations of this test, this will use generative testing to exercise various scaling factors, min/max constraints, and cloud load balancers
- Create a scaling group with various launch configuration data
- Create X scaling policies, some increasing and some decreasing
- Apply the policies in sequence (with variable amounts of time between policy executions)
- Validate end group state, taking into account cool-down times
- CATC-c003: Basic cases
Covered in cafe
Out of Band Delete (OOBD): Scale after delete below min. This case validates that hen out of band deletions bring the number of active servers below the group min, the servers are replaced up to the group min when convergence is triggered.
- Create a group with N minimum servers
- Delete N/2 servers out of band
- Trigger convergence
- Validate that convergence triggers and the group reaches the expected end state of N servers

CATC-004-a: Without CLB test_reaction_to_oob_server_deletion_below_min
CATC-004-b: With CLB
TBD

OOBD: Scale up after out of band delete with expected total over group min. When out of band deletions bring the number of active servers below the group min, the servers are replaced in addition to adding the requested servers when a policy scales to over the group min.
- Create a group with min N servers
- Scale group to (N+ x) servers
- Delete some of the servers out of band
- Scale up by (y) servers
- Validate end state of (N + x + y) servers
- CATC-005-a: Without CLB
  test_reaction_to_oob_deletion_then_scale_up
- CATC-005-b: With CLB
  TBD
OOBD: Scale down after out of band delete, non constrained. When scaling down after an out of band delete (OOBD) that is not constrained by the group max or min, the group stabilizes at a number of servers consistent with scaling from the active capacity before the OOBD. (i.e. The final result should be as if the OOBD never happened.)
- Create a group with min N servers
- Set the group to a desired capacity of x servers
- Delete z of the servers out of band
- Scale down by (y) servers
- Validate end state of (x - y)
- CATC-006-a: Where z < y, without CLB test_scale_down_after_oobd_non_constrained_z_lessthan_y
- CATC-006-b: Where z > y, without CLB test_scale_down_after_oobd_non_constrained_z_greaterthan_y
- CATC-006-c: Where z == y, without CLB test_scale_down_after_oobd_non_constrained_z_equal_y
- CATC-006-b: Where z < y, with CLB
  tbd
- CATC-006-a: Where z > y, with CLB
  tdd
- CATC-006-b: Where z == y, with CLB
  tbd
OOBD: Scale over max after OOB deletion. Validate that scaling up when already at the max returns a 403 even after an out of band delete (OOBD) has reduced the number of servers below the group max. Even though the policy cannot execute, convergence is triggered and the deleted servers are replaced.
- Create a scaling group
- Scale to MAX servers
- Delete ‘z’ servers OOB
- Attempt to scale up (i.e. beyond max)
- Validate that group returns to max servers and does not overshoot
- CATC-007-a: Attempt to scale above the group max, without CLB test_scale_up_after_oobd_at_group_max
- CATC-007-b: Attempt to scale below the group min, without CLB test_scale_down_past_group_min_after_oobd
- CATC-007-c: Attempt to scale above the group max, with CLB
  TBD
- CATC-007-d: Attempt to scale below the group min, with CLB
  TBD
OOBD: Confirm that updating the group configuration triggers convergence and fixes an out of band delete
- Create a group and set to N, where N > min
- Delete a server out of band
- Update the group config (ex. change the group MAX)
- Validate that updating the config triggered convergence back to N servers
- CATC-008-a: without CLB
  test_group_config_update_triggers_convergence
- CATC-008-b: with CLB
  TBD
Server Transitioned to Error (STE): STE from Building
- Validate that the expected group state is reached when servers error on build
Server Timeout on build
- Validate that the expected group state is reached when servers error on build
  - (ex. use mimic to Timeout 1/5 servers)
STE: Scale up after X servers on a group go from Active to Error - After Error, group fell below min - After Error no constraints - If not for Error, would be over Max when scaled up
STE: Scale down after X servers on a group go from active to Error - After Error, group fell below min (scale down should actually increase back to min) - After Error no constraints - If not for Error, would be over Max
STE: Update config after servers on group transition to Error
- Create Group with Servers > Min
- Use mimic to cause one to transition from ACTIVE->ERROR
- Update config (Ex. change MAX)
- Validate that errored servers are reaped and replaced
~~Repeat tests 4-12 with an associated CLB~~

`Reason: This coverage is documented in each of the individual cases`

OOB at CLB max: (WHITEBOX) Scale to CLB max after OOBD
- Create a scaling group
- Scale to 24 (CLB max -1) servers
- Delete ‘2’ servers OOB
- Attempt to scale up by 1 server (to CLB max)
- Validate that group attempts to return to 25 servers which will only be possible is the OOBD servers were correctly cleaned off the load balancers before being replaced.
- CATC-015-a: Group max == CLB max
  test_scaling_to_clb_max_after_oob_delete_type1
- CATC-015-b: Group max > CLB max
  test_scaling_to_clb_max_after_oob_delete_type2
~~Metadata: (WHITEBOX) Scale over group max after metadata removal~~
- ~~Create a scaling group~~
- ~~Scale to MAX servers~~
- ~~Remove Autoscale metadata from ‘z’ servers~~
- ~~Attempt to scale up (i.e. beyond max)~~
- ~~Validate that group returns to max associated servers and does not overshoot~~
Reason: Effective duplicate of case 18
Metadata w/CLB: (WHITEBOX) Scale over group max after metadata removal, Where Group MAX < CLB max (NOTE: Autoscale does not take responsibility for incorrect behavior after metadata tampering)
- Create a scaling group
- Scale to MAX servers
- Remove Autoscale metadata from ‘z’ servers
- Attempt to scale up (i.e. beyond max)
- Validate that group returns to max associated servers and does not overshoot (i.e. deleted servers do not become zombies on the group)
- Confirm that other scaling groups on tenant are not impacted
- CATC-017-a: Group max == CLB max
  test_scale_over_group_max_after_metadata_removal_reduced_grp_max
Metadata w/CLB: (WHITEBOX) Scale over group max after metadata removal, Where Group MAX == CLB max (25)
- Create a scaling group
- Scale to MAX servers
- Remove Autoscale metadata from ‘z’ servers
- Attempt to scale up (i.e. beyond max)
- Validate that group attempts to return to max associated servers and does not overshoot
- It is expected that a scaling attempt at the group max will return an error and trigger convergence
- CATC-018-a: Group max == CLB max
  test_scale_over_group_max_after_metadata_removal
Scale group with CLB to over 25 - validate that the group enters a TBD error state
Scale group with CLB deleted partway
- Create a group and scale by x
- Delete the CLB or otherwise cause it to return 404
- Attempt to scale up
- validate that the group enters a TBD error state
~~Switch load balancers before scale (NOTE: This scenario has been extensively discussed, and this is the current expected behavior as documented in TBD)~~ - ~~Create a group with CLB-A~~ - ~~Scale to X servers~~ - ~~Modify config to change to CLB-B~~ - ~~Scale to Y servers~~ - ~~Confirm that the first X servers are on CLB-A and the next Y on CLB-B~~ - ~~Scale down to min (0) – confirm that all servers successfully scale down~~

`Reason: As written, this is duplicated in cafe. Related behavior can be found in the LB healing tests.`

Multiple cloud load balancers, one goes into Err/Invalid before scale up, then eventually fixes itself
- Create a group with 2 CLBs & non-min servers
- Use mimic to cause one of the load balancers to go into error
- Attempt to scale up
- confirm that no progress is being made
- After time ‘t’ less than TBD time to permanent error state, fix the CLB
- Confirm that the scale up completes as expected
Simulate “Unable to remove Node” from CLB - Create a group with non-min servers and a CLB (or more) - Set mimic to return 422 (TBC) error on removal attempt from CLB - Attempt to scale down - (Need to confirm expected behavior)
Simulate periodic Nova 40xs (NOTE: If not periodic, these usually result from user error and can only by fixed by fixing the config) - For each of the following:
- 403: Not allowed to boot from image
- 403: Access was denied
- 400: Image not active
- 400: Bad Networks Format v. 401: ‘no body’ - Create group - Scale up - Mimic returns 40x some percent of the time - Group eventually reaches expected state
Simulate Nova Quota error (403: Quota), then recovery
- Create a group
- Cause mimic to return 403: Quota errors on scale up
- Wait for a time, then fix the quota problem (i.e. Nova now returns normal codes)
- Validate that scale up completes
Simulate identity errors by fail to authenticate client x% of the time (NOTE: The case of constant Identity failures is not convergence related) - Set up Mimic to return identity error on auth attempts x% of the time - Create a group - Create multiple policies on the group - execute policies in sequence - Confirm that group eventually reaches desired state
~~Force overshoot condition by altering config (NOTE: There are safeguards to prevent lowering MAX below actual. This might require instrumented injection or the case be deemed unneeded)~~ - ~~Create a scaling group~~ - ~~Scale to N servers~~ - ~~Update launch config to max less than N~~ - ~~Expect excess servers to be deleted~~

` Reason: Covered in cafe`

~~Force undershoot condition by altering config~~
- ~~Create a scaling group with X min servers~~
- ~~Scale up by Y servers~~
- ~~Update launch config to min = X + Y -1~~
- ~~Validate expected group state~~
Reason: Covered in cafe
Simulate accidental overshoot due to Nova 500 without an actual failure
- Create a scaling group
- Scale up by X
- Use mimic to cause Nova to return a 500 but build the server anyway
- Confirm that any excess builds are stopped and the correct number of servers is produced
~~Scheduled Policy Interactions (Possibly redundant with Cafe tests)~~
- ~~Test the following scenarios~~
  - ~~Scheduled scale up/down policy executed during scale up/down by percent~~
  - ~~Scheduled scale down policy executed during scale up/down by increment~~
  - ~~Scheduled scale up/down policy executed during scale to exactly~~
Reason: Covered in cafe
Race Conditions (Optional) Race conditions are difficult to simulate and the previously seen cases are unlikely to occur in the wild. The best way to simulate race conditions requires taking into account which API node receives a given request.
- ~~Scale to max_i +3, then change config to allow~~
- ~~Scale to max_i -3 then change config max_f to max_i – 5~~
- ~~Execute scheduled policy~~
Reason: This is not practical to implement

LB Self-Healing Tests (WIP)

Servers managed by autoscale SHALL be only on the nodes on the load-balancers in their launch config after convergence is triggered.
- 1 group, 1 LB in config:
  - Server node removed from LB
  - Assert: Server is put back
- 1 group, LB1 in config, LB2 not in any autoscale configs:
  - Server node moved from LB1 to LB2
  - Assert: Server put back on LB1
  - Assert: Server removed from LB2
- 1 group, LB1 in config, LB2 not in any autoscale configs:
  - Server node added to LB2 (now on both)
  - Assert: Server still on LB1
  - Assert: Server removed from LB2
Load balancers used in autoscale configureations MAY have non-autoscale servers on them and Convergence SHALL only clean up servers managed by autoscale.

(This test is largely covered by the others, but it is straightforward and can also be applied to RCV3 so I would advocate including it anyway.)
- 1 group, 1 LB in config, N > 2 non-autoscale servers on LB: - Servers should be directly created with Nova and added in a way that allows reuse for CLBs and RCV3
  - Remove Autoscale server from LB
  - Remove 1 non-otter server from LB
    - Assert: AS server is put back
    - Assert: non-otter server is not put back
    - Assert: Untouched non-autoscale servers are untouched
- 2 groups, 1 server, 1 LB each; 1 non-autoscale node3 on both LB1 and LB2, 1 non-autoscale node4 on LB1, 1 non-autoscale node5 on LB2
  - Switch servers between LB1 and LB2
  - Trigger convergence on group 2
    - Assert: server2 is removed from LB1
    - Assert: server2 is put back on LB2
    - Assert: node3 is still on both LB1 and LB2
    - Assert: node4 is only on LB1
    - Assert: node5 is only on LB2
  - Trigger convergence on group 1
    - Assert: server1 on LB1
    - Assert: server1 not on LB2
    - Assert: server2 on LB2
    - Assert: node3 on both
    - Assert: node4 on LB1
    - Assert: node5 on LB2
- 1 group, 2 servers, LB1 on group, LB2 not on group
  - Disown 1 server
  - Move disowned server to LB2
  - Trigger convergence:
    - Assert: moved server is not moved back to LB1
- 1 group, n servers, LB1 on group, LB2 not on group
  - Disown 1 server
  - Add disowned server to LB2 as well as LB1
  - Trigger convergence:
    - Assert: Disowned server is on both LB1 and LB2
When convergence is triggered all LB nodes corresponding to autoscale servers SHALL have an associated port that is included in the autoscale launch configuration they server was created with. (ie: autoscaled servers are only on those ports they’re supposed to be on)
- 2 Scaling groups on the same CLB
  - (group 1 on port 80, group 2 on port 8080, min 1 server.
  - Wait until active.
  - Move group 1’s server from port 80 to port 8080.
  - Trigger convergence on both.
    - Assert that group 1’s server is only on port 80,
    - Assert that group 2’s server is only on port 8080.
(Even if a different autoscale group uses the wrong port, fix a server if its group config does not.)
- 2 Scaling groups on the same CLB
  - (group 1 on port 80, group 2 on port 8080, min 1 server.
  - Wait until active.
  - Add a node with a server from group 1 on port 8080.
  - Trigger convergence on both.
    - Assert that group 1’s server is only on port 80,
    - Assert that group 2’s server is only on port 8080.
- 1 Scaling groups, with 3 CLB configs on the same CLB, but different ports (ie: each server has three nodes on the CLB)
  - Create at least one server on the group (port 80, 8080, 333)
  - On the CLB, change server 1's port 80 node to port 666
  - Trigger convergence
    - Assert that group 1’s server only on port 80, 8080, and 333
Tests that convergence will make sure autoscaled servers put CLB nodes back in the state they should be in:
- 1 Scaling group on 1 CLB with other non-autoscale nodes
- Change 1 server’s CLB node status
  - from ENABLED to DRAINING
  - from weight 1 to 2
  - from PRIMARY to SECONDARY
- Trigger convergence
  - Assert: all autoscaled servers should be in ENABLED mode, weight 1, PRIMARY

Convergence Testing Completion Criteria

All existing CloudCafe tests have been evaluated for applicability post convergence
Café API tests are updated for convergence as needed (expected to be minimal)
Define, mark, and automate with Jenkins a minimal set of gating tests
- Include a justification for why selected tests are sufficient gatingCode coverage requirements? Check common error cases?
Implement all cases in the above test case list that apply to completed and defined Autoscale behavior
- Applicable Autoscale behavior is definined in TBD document

Coverage

Unit test
- Run coverage of unit tests, Coverage = TBD% (Goal 100%)
API tests (CloudCafe)
- Coverage = TBD% (not expected to reach 100, will determine acceptable after seeing current, but at least 75%)
Integration Tests (Trial)
- Coverage = TBD%

Risks and Unknowns

Certain aspects of Autoscale behavior are not defined
- Examples
  - warning about exceeding the max for CLB
  - Rolling updates (or not) based on launch config changes - The current understanding is that currently no changes are rolled – at one point it was thought that CLB would be, but now we are storing CLB config in meta-data at creation and so check compliance against that data rather than the config
  - What if any warning if there are massive auth failures across accounts?
  - Dependencies on Autoscale development
Inability to perform complete black box testing
- Since the point of the system is to black box any errors, some level of internal knowledge and visibility is required in order to perform integration tests and verify correct handling of external failures.