Stack upgrades & ProcessSync: What happens if apps use a stack the platform no longer supports? - cloudfoundry/cloud_controller_ng GitHub Wiki
If I upgrade CC to a version that drops support for a stack, but some of my running apps still use that stack...
Summary
Diego will be very careful not to cause unexpected app downtime.
Those freshly invalid, old-stack apps will continue to run and be routable, but CC will no longer be able to send updates of them to Diego.
The system will recognize this and refuse to delete any compute resources until it can confirm that they aren't the old-stack apps that it can no longer sync.
What happens to the apps that are still using that unsupported stack?
They continue to exist in CCDB
They continue to exist as BBS as Diego DesiredLRPs
They continue to run on Diego Cells as Diego ActualLRPs (?)
They continue to be routable (?)
They can no longer be updated or created in Diego
Updates and creates will result in the error no compiler defined for requested stack
Any change to the process' updated_at will make Diego's DesiredLRP out-of-date
The ProcessSync loop will attempt to update all out-of-date DesiredLRPs
Because the domain is unfresh:
They can be deleted in the CF API, but Diego will not stop running their ActualLRPs (?)
What happens to the sync loop?
It continues to run
In parallel, it continues to sync as many CC processes as possible to Diego as DesiredLRPs
Any app with an unsupported stack will error on update if Diego's DesiredLRP is out-of-date.
Update errors will prevent freshness from being bumped
All errors encountered should be logged by the clock
No destructive action will be taken against LRPs in that domain
Processes with unsupported stacks will continue to run (unless Diego has dropped them during evacuation?)
Processes that have been deleted in CC but exist in Diego will continue to run
Creates and updates of processes will continue to work fine
What happens to apps that synced successfully?
They can be created, updated, scaled, etc
Because the domain is unfresh:
They can be deleted in the CF API
BUT Diego will not stop running their ActualLRPs
What happens to "mysterious" ActualLRPs that CCDB has no record of?
They cannot be deleted in the CF API
Because the domain is unfresh:
Diego will not stop running their ActualLRPs
Open Questions
Is this the best we can do to handle this class of failure?
Should we tolerate unknown stack errors for bumping freshness?
What does Diego do if you're evacuating the last cflinuxfs2 cells?
Do the apps stop running?
Does the deployment error?
If the apps stop running and the mitigation here is to STOP them in CCDB, would it be better to bump freshness if the only errors during sync are about unknown stacks?
Collated context of how we came to have this behavior
October 2018: #156029607 We made uncaught errors on the clock log and exit 1.
November 2018: #162064721 We made most errors log, but continue to sync and refuse to bump freshness.
November 2018: #161800100 We verified this behavior applies to apps with absent stacks.
May 2020: Pivotal Slack We started seeing a rash of this in escalations, with log lines where cc.diego.sync.processes logged sync-failed and error-updating-lrp-state