Running .after_script Commands on a Remote Host: Challenges and Solutions - TerrenceMcGuinness-NOAA/global-workflow GitHub Wiki

Running `.after_script` Commands on a Remote Host: Challenges and Solutions

Running .after_script commands on a remote host presents several challenges due to the different execution contexts of the main job and the cleanup scripts. The primary issues revolve around maintaining state (like SSH connections), handling job failures, and accessing resources on the remote machine.

State and Environment Challenges

Different shell sessions: A significant challenge in GitLab CI is that the after_script section runs in a separate, new shell session from the main script section. This means that any environment variables, SSH agents, or other local state configured in before_script or script are not automatically available to the after_script.
Dropped SSH connections: The most common way to run commands on a remote host is by using SSH. However, if a job fails due to a system failure (e.g., the remote machine crashes or reboots), the SSH connection is terminated. The GitLab Runner may fail to execute the after_script if it cannot re-establish the connection, leaving resources uncleared.
Variable persistence: Any variables that are not explicitly exported or handled between the different script sections will be lost. This is particularly problematic for sensitive information or temporary values needed for cleanup.

Failure Handling Challenges

after_script not running on all failures: Historically, the after_script was not guaranteed to run if a job was cancelled or timed out, which could leave remote resources unmanaged. While recent versions of GitLab CI have improved this behavior to run after_script for canceled jobs, other system failures can still prevent cleanup scripts from executing.
Incorrect timeout handling: A known bug in GitLab can cause a job to report a "timeout" error instead of the actual error from the main script if the total runtime of the main script plus the after_script exceeds the main script's defined timeout.

Security and Authentication Challenges

SSH agent forwarding: While SSH agent forwarding can simplify authentication, it is not enabled by default for after_script due to the separate shell session. To use it, you need a workaround to pass the agent's process ID (PID) from the main script to the cleanup script.
Handling credentials: You must securely manage the credentials needed to connect to the remote host in both the main script and the after_script. Using CI/CD variables to store private keys is a common practice, but it requires careful handling.

Workarounds and Solutions

To address these challenges, implement strategies to ensure state is maintained and cleanup commands are executed reliably.

1. Persist State Between Script Sections

Save the SSH agent PID: In your before_script, start the SSH agent and save its process ID to a file. In the after_script, read the PID from the file to reconnect and kill the agent, ensuring the connection remains active for cleanup tasks.
```
before_script:
  - eval $(ssh-agent -s)
  - echo "$SSH_PRIVATE_KEY" | tr -d '\r' | ssh-add -
  - echo $SSH_AGENT_PID > .ssh_agent_pid
after_script:
  - eval $(cat .ssh_agent_pid)
  - kill "$SSH_AGENT_PID"
```
Persist other variables: Use files to store any other variables or state that the after_script needs to know. For example, if your job spins up a temporary cloud resource, save its ID to a file so that the after_script can reference and tear it down.

2. Ensure `after_script` Runs on Failure

Rerun SSH commands: If a job fails and terminates the SSH connection, you may need to re-authenticate inside the after_script to execute any cleanup commands. The after_script must be robust enough to handle a potentially broken connection by retrying or explicitly reconnecting.
Implement a "finally" job: Create a separate job with when: always to handle cleanup tasks. This is a more explicit and robust way to manage teardown routines and may be a better option than relying solely on after_script.

3. Manage Artifacts and Dependencies

Transfer artifacts: If the after_script needs access to files generated by the main job, use GitLab's artifact system. This is especially useful for passing information between stages or for jobs that may run on different runners.
Use needs:artifacts: For jobs within the same pipeline, use needs:artifacts to specify that a job depends on the artifacts of another. This is more explicit and reliable than assuming the files will be present.

4. Enhance Security and Robustness

Use volumes for Docker runners: If you are using a Docker executor, mount a shared directory from the host into the job container. This allows the job to interact with a consistent file system, which can help with persistence and sharing data for cleanup scripts.
Use a shell executor (with caution): Consider using a shell executor if you require jobs to run directly on the host machine where the runner is installed. While simpler for remote interactions, it offers less isolation and is generally less secure than using a Docker executor.
After_script and variables - GitLab CI/CD - GitLab Forum
Is it possible to reboot a machine after system failure in CI test
Ensure after_script is called for cancelled and timed out ...