Stack upgrades & ProcessSync: What happens if apps use a stack the platform no longer supports? - cloudfoundry/cloud_controller_ng GitHub Wiki

If I upgrade CC to a version that drops support for a stack, but some of my running apps still use that stack...

Diego will be very careful not to cause unexpected app downtime.
Those freshly invalid, old-stack apps will continue to run and be routable, but CC will no longer be able to send updates of them to Diego.
The system will recognize this and refuse to delete any compute resources until it can confirm that they aren't the old-stack apps that it can no longer sync.

They continue to exist in CCDB
They continue to exist as BBS as Diego DesiredLRPs
They continue to run on Diego Cells as Diego ActualLRPs (?)
They continue to be routable (?)
They can no longer be updated or created in Diego
- Updates and creates will result in the error no compiler defined for requested stack
- Any change to the process' updated_at will make Diego's DesiredLRP out-of-date
- The ProcessSync loop will attempt to update all out-of-date DesiredLRPs
Because the domain is unfresh:
- They can be deleted in the CF API, but Diego will not stop running their ActualLRPs (?)

It continues to run
In parallel, it continues to sync as many CC processes as possible to Diego as DesiredLRPs
Any app with an unsupported stack will error on update if Diego's DesiredLRP is out-of-date.
Update errors will prevent freshness from being bumped
All errors encountered should be logged by the clock

They can be created, updated, scaled, etc
Because the domain is unfresh:
- They can be deleted in the CF API
- BUT Diego will not stop running their ActualLRPs

Is this the best we can do to handle this class of failure?
Should we tolerate unknown stack errors for bumping freshness?
What does Diego do if you're evacuating the last cflinuxfs2 cells?
- Do the apps stop running?
- Does the deployment error?
If the apps stop running and the mitigation here is to STOP them in CCDB, would it be better to bump freshness if the only errors during sync are about unknown stacks?

October 2018: #156029607 We made uncaught errors on the clock log and exit 1.
November 2018: #162064721 We made most errors log, but continue to sync and refuse to bump freshness.
November 2018: #161800100 We verified this behavior applies to apps with absent stacks.
December 2018: A KB Article was written about recovering from this issue
May 2020: Pivotal Slack We started seeing a rash of this in escalations, with log lines where cc.diego.sync.processes logged sync-failed and error-updating-lrp-state