spank_oodproxy - BYUHPC/oodproxy GitHub Wiki

spank_oodproxy

Overview

BYU's oodproxy is a system designed to provide secure port forwarding for jobs running on a Slurm cluster. It enables users to access network ports open on a compute node, which are typically isolated from direct user access.

This SPANK plugin:

  1. Creates certificates for use in mutual TLS (mTLS) authentication between the oodproxy server and a program such as stunnel that will be launched inside of the job
  2. Gathers a list of network ports opened inside the job. This ensures that users can only use the proxy to connect to their own processes.

This is only one piece of the puzzle. You need other components that will be documented soon.

Architecture

The system consists of the following components:

  1. SPANK Plugin (spank_oodproxy.c): Integrates with Slurm to handle certificate generation and management during job lifecycle
  2. Certificate Generation Script (oodproxy_gencerts.sh): Creates the necessary TLS certificates for mTLS
  3. Port Registration Daemon (oodproxy_regd.sh): Discovers and registers open listening ports within the job
  4. Proxy Server (separate project to be documented later): Uses the generated certificates to establish secure connections to the job

Workflow

                                               ┌────────────────────┐
                                               │                    │
                                               │  External Client   │
                                               │                    │
                                               └──────────┬─────────┘
                                                          │
                                                          ▼
             ┌────────────────────────────────────────────────────────────────────────────┐
             │                                                                            │
             │                              Proxy Server                                  │
             │  (Uses generated certificates to establish mTLS connections to the job)    │
             │                                                                            │
             └────────────────────────────────────────────┬───────────────────────────────┘
                                                          │
                                                          |
┌─────────────────────────────────────────────────────────+───────────────────────────────┐
│                             Compute Node                |                               │
│                                                         ▼                               │
│  ┌─────────────────┐      ┌────────────────┐      ┌────────────┐      ┌───────────────┐ │
│  │                 │      │                │      │            │      │               │ │
│  │  SPANK Plugin   ├──────►  Generate      ├──────►  stunnel   ├──────►  Process with │ │
│  │                 │      │  TLS Certs     │      │            │      │  Open Port    │ │
│  └────────┬────────┘      └────────────────┘      └────────────┘      └───────────────┘ │
│           │                                                                             │
│           │                                                                             │
│  ┌────────▼────────┐      ┌────────────────┐                                            │
│  │                 │      │                │                                            │
│  │  Registration   ├──────►  Allowed       │                                            │
│  │  Daemon         │      │  Destinations  │                                            │
│  │                 │      │                │                                            │
│  └─────────────────┘      └────────────────┘                                            │
│                                                                                         │
└─────────────────────────────────────────────────────────────────────────────────────────┘

Components

SPANK Plugin (spank_oodproxy.c)

The SPANK plugin integrates with Slurm and performs the following functions:

  1. Initialization:

    • Parses configuration parameters from plugstack.conf
    • Sets up necessary environment variables
    • Creates directory structure for TLS certificates
  2. Certificate Management:

    • Invokes the certificate generation script during job startup
    • Ensures certificates are cleaned up during job termination
  3. Port Registration:

    • Launches the registration daemon to discover and record listening ports
    • Creates and manages the allowed_destinations file, which lists accessible services
  4. Security:

    • Manages file permissions to ensure secure access to certificates

Certificate Generation Script (oodproxy_gencerts.sh)

This script creates the necessary TLS certificates for mTLS authentication:

  1. Generates a Certificate Authority (CA) key and certificate
  2. Creates server and client certificates signed by the CA
  3. Copies the certificates to locations accessible to the job and the proxy server
  4. Sets proper ownership and permissions on the certificate files

The script generates:

  • ca.key: Certificate Authority private key
  • ca.crt: Certificate Authority certificate
  • server.key and server.crt: Server-side key and certificate
  • client.key and client.crt: Client-side key and certificate
  • ca+client.crt: Combined CA and client certificates

Port Registration Daemon (oodproxy_regd.sh)

The registration daemon is responsible for:

  1. Waiting for the job to signal that it has started its service(s)
  2. Discovering listening TCP and UDP ports within the job's process namespace
  3. Writing a list of host:port combinations to the allowed_destinations file

It uses lsof to detect open ports and generates entries for each host and port combination, making them available to the proxy server.

Installation and Configuration

Prerequisites

  • Slurm with SPANK plugin support
  • Build tools including gcc and the Slurm development headers
  • openssl client commands
  • lsof

Compiling the SPANK Plugin

Run make then copy the .so to the proper place for Slurm. Or copy it wherever you would like and point plugstack.conf to it.

Configuration in plugstack.conf

Add the following to your Slurm plugstack.conf:

required /path/to/spank_oodproxy.so registration_daemon=/path/to/oodproxy_regd.sh oodproxy_root=/some/shared/fs/oodproxy/jobs gencerts=/path/to/oodproxy_gencerts.sh PATH=/usr/bin:/usr/sbin

Configuration parameters:

  • registration_daemon: Path to the registration daemon script
  • oodproxy_root: Root directory for storing certificates and job information
  • gencerts: Path to the certificate generation script
  • webserver_gid: GID of the webserver user (can be numeric or group name)
  • PATH: Environment PATH to use when executing scripts

Directory Structure

The plugin creates the following directory structure:

/some/shared/fs/oodproxy/               # oodproxy_root
  └── <user_id>/                          # Per-user directory
      └── <job_id>/                       # Per-job directory
          └── allowed_destinations        # List of accessible host:port combinations

Additionally, it creates a temporary directory in /tmp/.oodproxy-XXXXXX/ to store the TLS certificates.

Usage

Job Submission

To enable OODProxy for a job, use the --oodproxy-register=1 option with sbatch:

sbatch --oodproxy-register=1 job_script.sh

In the Job Script

Jobs need to signal when they are ready for port registration by writing to the file descriptor specified in the OODPROXY_REG_READY_FD environment variable:

# Start your service
python -m http.server 8888 &

# Signal that the service is ready for registration
if [[ -n "$OODPROXY_REG_READY_FD" ]]; then
    echo >&$OODPROXY_REG_READY_FD
fi

That causes the registration daemon to survey the running processes from the user in that job to see what ports are open. It then writes that out to an allowed_destinations file. That file is later used by the proxy to determine if a destination is allowed when a user wants to connect to it. This ensures that users can only contact ports that they themselves opened.

TLS Certificate Access

The job can access the TLS certificates in the directory specified by the OODPROXY_DIR environment variable:

# Access certificate paths
CA_CERT="${OODPROXY_DIR}/ca.crt"
SERVER_CERT="${OODPROXY_DIR}/server.crt"
SERVER_KEY="${OODPROXY_DIR}/server.key"

Security Considerations

The system implements several security measures:

  1. mTLS Authentication: Both the client and server verify each other's identity
  2. Limited Access: Only registered ports are accessible through the proxy
  3. Unique Certificates: Each job gets its own unique set of certificates
  4. Cleanup on Exit: Certificates and registration information are removed when jobs end

Certificate Lifecycle

  1. Certificates are generated when the job starts
  2. They are valid for 365 days, intended to be at least as long as the longest job (configurable in oodproxy_gencerts.sh)
  3. They are destroyed when the job ends
  4. Each certificate has a unique UUID-based CN

Integration with stunnel or Similar

While not provided in the code, OODProxy is designed to work with TLS termination programs like stunnel.

Example: stunnel Server-side Configuration (In the Job)

This is an example of how to use the certs in stunnel. Configurations vary wildly depending on whether sd_listen_fds is used, etc.

[server]
cert = ${OODPROXY_DIR}/server.crt
key = ${OODPROXY_DIR}/server.key
CAfile = ${OODPROXY_DIR}/ca+client.crt
requireCert = yes
verifyChain = yes
verifyPeer = yes
accept = 0.0.0.0:8443
connect = 127.0.0.1:8888

Troubleshooting

Potential Issues

  1. Certificate Generation Failures:

    • Check permissions on the oodproxy_root directory
    • Ensure the openssl command is available in the configured PATH
  2. Port Registration Issues:

    • Verify the job is correctly signaling readiness
  3. Directory Cleanup Failures:

    • NFS-related issues may prevent immediate directory removal
  4. GID Mismatch

    • Make sure the webserver_gid matches that of the web server that the user's browser talks to

Limitations and Future Work

  1. IPv6 Support: Currently limited to IPv4 addresses
  2. Port Scanning: Relies on lsof for port discovery, which might miss some cases (no known cases yet)

Other Approaches

Rather than tie this implementation to Slurm, it would be very feasible to write a registration daemon that accepts connections over a Unix socket. SO_PEERCRED could be used by the daemon to check who is on the other side then perform the same registration function. Why not do it that way now? Because! I started off using SPANK and this seemed easy enough. If someone else prefers a different approach, go for it. I'm not set on the SPANK plugin approach.

Other Notes

We don't have a great name for this proxy solution. "oodproxy" seems decent enough, but we don't want to confuse anyone and make them think this is an official OOD project so, for now, we're referring to it as BYU's oodproxy.

⚠️ **GitHub.com Fallback** ⚠️