Running .after_script Commands on a Remote Host: Challenges and Solutions - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki
.after_script
Commands on a Remote Host: Challenges and Solutions
Running Running .after_script
commands on a remote host presents several challenges due to the different execution contexts of the main job and the cleanup scripts. The primary issues revolve around maintaining state (like SSH connections), handling job failures, and accessing resources on the remote machine.
State and Environment Challenges
- Different shell sessions: A significant challenge in GitLab CI is that the
after_script
section runs in a separate, new shell session from the main script section. This means that any environment variables, SSH agents, or other local state configured inbefore_script
orscript
are not automatically available to theafter_script
. - Dropped SSH connections: The most common way to run commands on a remote host is by using SSH. However, if a job fails due to a system failure (e.g., the remote machine crashes or reboots), the SSH connection is terminated. The GitLab Runner may fail to execute the
after_script
if it cannot re-establish the connection, leaving resources uncleared. - Variable persistence: Any variables that are not explicitly exported or handled between the different script sections will be lost. This is particularly problematic for sensitive information or temporary values needed for cleanup.
Failure Handling Challenges
after_script
not running on all failures: Historically, theafter_script
was not guaranteed to run if a job was cancelled or timed out, which could leave remote resources unmanaged. While recent versions of GitLab CI have improved this behavior to runafter_script
for canceled jobs, other system failures can still prevent cleanup scripts from executing.- Incorrect timeout handling: A known bug in GitLab can cause a job to report a "timeout" error instead of the actual error from the main script if the total runtime of the main script plus the
after_script
exceeds the main script's defined timeout.
Security and Authentication Challenges
- SSH agent forwarding: While SSH agent forwarding can simplify authentication, it is not enabled by default for
after_script
due to the separate shell session. To use it, you need a workaround to pass the agent's process ID (PID) from the main script to the cleanup script. - Handling credentials: You must securely manage the credentials needed to connect to the remote host in both the main script and the
after_script
. Using CI/CD variables to store private keys is a common practice, but it requires careful handling.
Workarounds and Solutions
To address these challenges, implement strategies to ensure state is maintained and cleanup commands are executed reliably.
1. Persist State Between Script Sections
-
Save the SSH agent PID: In your
before_script
, start the SSH agent and save its process ID to a file. In theafter_script
, read the PID from the file to reconnect and kill the agent, ensuring the connection remains active for cleanup tasks.before_script: - eval $(ssh-agent -s) - echo "$SSH_PRIVATE_KEY" | tr -d '\r' | ssh-add - - echo $SSH_AGENT_PID > .ssh_agent_pid after_script: - eval $(cat .ssh_agent_pid) - kill "$SSH_AGENT_PID"
-
Persist other variables: Use files to store any other variables or state that the
after_script
needs to know. For example, if your job spins up a temporary cloud resource, save its ID to a file so that theafter_script
can reference and tear it down.
after_script
Runs on Failure
2. Ensure - Rerun SSH commands: If a job fails and terminates the SSH connection, you may need to re-authenticate inside the
after_script
to execute any cleanup commands. Theafter_script
must be robust enough to handle a potentially broken connection by retrying or explicitly reconnecting. - Implement a "finally" job: Create a separate job with
when: always
to handle cleanup tasks. This is a more explicit and robust way to manage teardown routines and may be a better option than relying solely onafter_script
.
3. Manage Artifacts and Dependencies
- Transfer artifacts: If the
after_script
needs access to files generated by the main job, use GitLab's artifact system. This is especially useful for passing information between stages or for jobs that may run on different runners. - Use
needs:artifacts
: For jobs within the same pipeline, useneeds:artifacts
to specify that a job depends on the artifacts of another. This is more explicit and reliable than assuming the files will be present.
4. Enhance Security and Robustness
-
Use volumes for Docker runners: If you are using a Docker executor, mount a shared directory from the host into the job container. This allows the job to interact with a consistent file system, which can help with persistence and sharing data for cleanup scripts.
-
Use a shell executor (with caution): Consider using a shell executor if you require jobs to run directly on the host machine where the runner is installed. While simpler for remote interactions, it offers less isolation and is generally less secure than using a Docker executor.
-
Is it possible to reboot a machine after
system failure
in CI test -
Ensure after_script is called for cancelled and timed out ...