Problems around updating Salt Minion (bundled) with Salt - uyuni-project/uyuni GitHub Wiki

Problem description

Updating Salt with Salt is problematic because the update replaces Python code and salt-minion needs to be restarted to be able to use the update code. One part that is contributing to this problem is the Salt loader, which caches Python function objects. After Python functions are changed by an update, the cache can be outdated and produce stack traces. A similar problem is that parts of the Salt code base is loaded into memory while other parts might not be in memory yet. The on-disk code is getting updated while the in-memory code is not. When the update is changing the internal API (e.g. changes to function definitions), the in-memory code might do incorrect calls, again leading to stack traces.

With traditional Salt packages there is an addition cause for Salt loader problems: they can also happen when dependencies are updated, including Python itself. Vendoring all used dependencies in the bundle solves this specific problem, but keeps the general problem of updating salt-minion without restarting it.

Now that we have established that a salt-minion restart is needed, we run into the next problem: salt-minion can lose Salt jobs when it is getting restarted. salt-minion does not implement persistent Salt jobs. In other words, it loses information when it is restarted. The consequence is that the restart must be timed in a way that it happens after the current job is done.

If we add these problems together we see that we need to update salt-minion as soon as possible after updating the code and after everything in the Salt job is done. One way to work with this is to always to the salt-minion update last. That way it's possible to systemctl restart salt-minion right after the update without losing other states, as those are already finished.

Solution: Always update venv-salt-minion last ?

This solution is already implemented in different parts of Uyuni, but it has a big issue: the implementation is done in Salt states. Users might bring their own states that cause the described problem.

Package Installation

We're taking care of installing our Salt minion package with Uyuni-provided states that contain order: last. Installing a Salt minion only happens during bootstrap over Salt SSH and does not suffer from the described problem.

Patch Installation

Updates to salt-minion happen in a different state (id: mgr_update_stack_patches) than regular updates, but in the same state file (patchinstall.sls). While the installation is done before other installations, this is normally not a source of problems. rpm triggers a systemctl restart salt-minion in the packages %post scriptlet, causing systemd to send SIGTERM to salt-minion. This signal is caught and postponed, the restart happens once the state execution is done. (I am not sure when exactly the restart happens, i.e. if it is after the completion of this sls file or after executing other states as well.)

Product Migration

Product/SP Migration calls zypper dup under the hood. In this case a salt-minion package up-/downgrade is not done separately. In practice, the spmigration state works without a problem, but there can be a race condition where Uyuni sends a follow-up job too quickly while salt-minion is being restarted. This was fixed in uyuni-project/uyuni#3937.

Keep latest package + Highstate

We don't currently do anything about this use-case. A user might visit "System -> States -> Packages" and set salt-minion to "Installed/Latest". Doing this includes package updates into the system's highstate whenever it becomes available.

Implementation-wise the "Installed/Latest" setting is translated into a pkg.latest state without an explicit ordering. Multiple packages, if configured as "Installed/Latest" are added to the pkgs list of the same pgk.latest state, but that is not the problem. The problem here is that the state execution happens an an unknown time, likely somewhere in the middle of all states that are part of the highstate.

Solution: Atomically update venv-salt-minion ?

This solution is implemented inside the bundle, which also works with a state that install the venv-salt-minion update first.

How could this work? The venv-salt-minion update does not replace any file, it only adds a new version of the bundle in its own location. The old salt-minion process sticks around and uses its share if files, even after the update. Then, when salt-minion is idle, the restart happens which uses the newly installed bundle.

Multiple versions must be co-installable

This solution requires the ability to install multiple versions of venv-salt-minion at a given point in time. It is only possible in SUSE-family distributions to do that, RHEL-family and Debian-family distributions don't support something like Provides: multiversion(venv-salt-bundle).

A potential workaround that I haven't researched further is to not use native packages and instead come up with another deployment strategy. Something like creating a tarball on the Uyuni server, copying it to the client and extracting it.

Insight into salt-minion state

Triggering the restart at the correct moment needs knowledge about the current workload. The best component to know what is going on is salt-minion itself. There might be a way to monitor if all received jobs are done, I haven't looked at this closely.

The jobs are executed by worker threads/subprocesses which send the results back and the main salt-minion process might be oblivious to the job status. But since there are utility functions that e.g. cause salt-minion to terminate all jobs, some introspection capabilities to hook into are probably available.

Trigger service restart

salt-minion already communicates with systemd via the sd_notify protocol. WATCHDOG=trigger can be send from the process to systemd to trigger the systemd watchdog. Our systemd service definition includes Restart=on-failure, which include a "watchdog timeout". Using the mentioned WATCHDOG=trigger is equivalent to a watchdog timeout.

Conclusion

This solution depends first and foremost on the availability of co-installable venv-salt-minion packages. Without being able to install two versions at once, we can't keep the old version when we update to the new one.

Only SUSE-family distributions support this package management feature, which is not enough for Uyuni. We would need to bypass the native package management of our clients to co-install different versions of the Salt bundle.

Solution: Persistent States ?

The idea is to enable salt-minion to save its current state, restart and resume. This would be a change to the very core of Salt, which is not easy from both a technological and a political point of view. Such changes must be discussed with upstream and these discussions take a lot of time.

I haven't spend much time on this idea in the scope of researching solutions, but it might be the best technical solution.

Uyuni has implemented Action Chains using the same idea, but with different mechanisms. We could revisit and probably simplify Action Chains if persistent states work.