Ansible Playbook Reference - thebaulab/onramp GitHub Wiki

Ansible Playbook Reference

This page documents the Ansible playbooks used to manage the baulab.us cluster.

What is Ansible and Why Do We Use It?

The Problem: Configuring 25+ Machines

Without automation, installing a package on all cluster machines means SSHing into each of 25+ machines and running the install command. Configuration changes, security updates, and package installations all require the same manual repetition.

The Solution: Ansible

Ansible is an automation tool that lets you describe what you want and then applies it to many machines at once:

# This says: "I want htop installed on all machines"
- name: Install htop
  apt:
    name: htop
    state: present

When you run this playbook, Ansible:

  1. SSHs into all target machines (in parallel!)
  2. Checks if htop is already installed
  3. Installs it only if needed
  4. Reports what it did

Key concept: Idempotence - Running the same playbook twice should have the same result as running it once. If htop is already installed, Ansible does nothing on the second run.

Why Not Just Write a Shell Script?

You could write:

for host in tokyo osaka karakuri ...; do
    ssh $host "apt install htop"
done

But this has problems:

  • No parallelism (slow)
  • No error handling
  • No idempotence (runs apt install every time)
  • No structured way to organize tasks
  • Hard to maintain complex configurations

Ansible solves all of these.

How Our Ansible Setup Works

The Control Node

Ansible runs from a "control node" - a machine that SSHes into all the others. In our case, that's baunames.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    BAUNAMES (Ansible Control Node)                      β”‚
β”‚                                                                         β”‚
β”‚   /srv/ansible/          <- Playbooks for Ubuntu 20.04                  β”‚
β”‚   /srv/ansible24/        <- Playbooks for Ubuntu 24.04                  β”‚
β”‚   /etc/ansible/hosts     <- Inventory (list of machines)                β”‚
β”‚                                                                         β”‚
β”‚   You run: ansible-playbook packages.yml                                β”‚
β”‚                                                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                                 β”‚ SSH (parallel connections)
                                 β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚                            β”‚                            β”‚
    β–Ό                            β–Ό                            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  tokyo  β”‚                β”‚  osaka  β”‚                β”‚karakuri β”‚
β”‚         β”‚                β”‚         β”‚                β”‚         β”‚
β”‚ htop    β”‚                β”‚ htop    β”‚                β”‚ htop    β”‚
β”‚installedβ”‚                β”‚installedβ”‚                β”‚installedβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Two Playbook Directories

We have two sets of playbooks:

Directory Target Notes
/srv/ansible/ Ubuntu 20.04 machines Current standard
/srv/ansible24/ Ubuntu 24.04 machines For newer/migrated machines

Why two? Some package names or config file locations differ between Ubuntu versions. The 24.04 playbooks are updated for the newer OS.

The Inventory File

Ansible needs to know what machines exist. This is defined in /etc/ansible/hosts:

[ldapserver]
baunames
baunames24

[webservers]
bauserver
bauserver24

[gpus]
tokyo
karasuno
umibozu
hawaii
karakuri
kyoto
saitama
hokkaido
andromeda
kobe
osaka
# ... and more

[remote]
nagoya.research
hakone

The [brackets] define groups. Playbooks can target specific groups:

  • ansible-playbook cuda.yml -l gpus - Only GPU machines
  • ansible-playbook webserver.yml -l webservers - Only web servers

Running Ansible

Basic Commands

First, SSH to baunames as a local admin:

ssh baunames   # Uses your localdavidbau or localarnab account
cd /srv/ansible

Common commands:

# Run all playbooks on all hosts
ansible-playbook main.yml

# Run on a specific host only
ansible-playbook main.yml -l saitama

# Run a specific playbook
ansible-playbook cuda.yml -l gpus

# Dry run (see what would change without changing anything)
ansible-playbook packages.yml --check

# Verbose output (helpful for debugging)
ansible-playbook packages.yml -v

What -l (limit) Does

The -l flag limits which hosts to target:

# Only saitama
ansible-playbook main.yml -l saitama

# Only GPU machines
ansible-playbook main.yml -l gpus

# Multiple specific hosts
ansible-playbook main.yml -l "tokyo,osaka,karakuri"

Always use -l when testing - you don't want to accidentally break all machines at once!

Why Local Admin Accounts?

Notice we use localdavidbau or localarnab, not our LDAP accounts. Why?

The ansible playbook ldapclient.yml configures LDAP authentication itself. If we used LDAP accounts to run it and something broke, we'd lock ourselves out! Local accounts work even when LDAP is down.

Playbook Reference

main.yml - The Master Playbook

main.yml runs all the other playbooks in order:

- import_playbook: ldapclient.yml
- import_playbook: localadmin.yml
- import_playbook: autofs.yml
- import_playbook: packages.yml
- import_playbook: eternal.yml
- import_playbook: webserver.yml
- import_playbook: cuda.yml
- import_playbook: nvidiadocker.yml
- import_playbook: miniconda.yml

This is typically what you run to fully configure a new machine.

Playbooks for All Hosts

Playbook What It Does When to Run
ldapclient.yml Configures SSSD for LDAP auth New machine, or LDAP config changed
localadmin.yml Creates localdavidbau/localarnab accounts New machine, or adding admin
autofs.yml Configures NFS automounting New machine, or NFS config changed
packages.yml Installs standard packages New machine, or adding packages
eternal.yml Installs Eternal Terminal New machine

GPU-Only Playbooks

Playbook What It Does When to Run
cuda.yml Installs NVIDIA drivers + CUDA New GPU machine, or driver update
nvidiadocker.yml Installs Docker with GPU support Needed for containerized ML
miniconda.yml Installs Miniconda to /opt New machine
driver.yml Fixes common NVIDIA driver issues When GPUs aren't working
gcloud.yml Installs Google Cloud SDK If needed for GCP
grubipv6.yml Disables IPv6 via GRUB Troubleshooting network issues
reboot.yml Reboots the host After driver updates

Server-Only Playbooks

Playbook What It Does Target
ldapserver.yml Configures OpenLDAP server baunames only
webserver.yml Installs Apache, Let's Encrypt bauserver only

What Each Playbook Does (Details)

ldapclient.yml

This is the most important playbook - it makes LDAP authentication work.

What it configures:

  1. Installs packages: sssd, sssd-ldap, ldap-utils, libnss-sss, libpam-sss
  2. Copies the LDAP CA certificate to /usr/local/share/ca-certificates/
  3. Deploys /etc/sssd/sssd.conf - tells SSSD how to reach baunames
  4. Deploys /etc/ldap/ldap.conf - LDAP client settings
  5. Deploys /etc/nsswitch.conf - tells the system to use SSSD for users/groups
  6. Enables pam_mkhomedir - auto-creates home directories on first login
  7. Restarts SSSD

When things go wrong: If users can't log in, this is often the culprit. Check /etc/sssd/sssd.conf and systemctl status sssd.

localadmin.yml

Creates emergency local admin accounts.

What it does:

  1. Creates localdavidbau and localarnab users
  2. Adds them to sudo group
  3. Installs their SSH public keys (from /srv/ansible/pubkeys/)
  4. Configures passwordless sudo

Why this matters: These accounts work even when LDAP is completely broken. Essential for recovery.

autofs.yml

Sets up automatic NFS mounting.

What it deploys:

  • /etc/auto.master - Tells autofs to manage /share
  • /etc/auto.share - Maps /share/* to NFS servers

Result: When you cd /share/u/yourname, autofs automatically mounts the NFS share.

packages.yml

Installs standard software everyone expects.

Packages installed:

  • Editors: vim, emacs
  • Terminals: tmux, screen, mosh, zsh
  • Network: openssh-server, curl, aria2
  • Monitoring: htop, glances
  • Development: git, python3-distutils
  • Documentation: texlive-full, pandoc
  • Utilities: jq, tree, ncdu, strace, tcpdump

Also removes: tracker and tracker-extract (file indexing that slows things down)

cuda.yml

Installs NVIDIA drivers and CUDA toolkit.

What it does:

  1. Fixes any broken apt installations
  2. Removes conflicting old drivers
  3. Runs ubuntu-drivers autoinstall
  4. Installs nvidia-cuda-toolkit

Important: After running this, you usually need to reboot:

ansible-playbook reboot.yml -l hostname

miniconda.yml

Installs a system-wide Miniconda.

Details:

  • Version: 23.11.0 with Python 3.11
  • Location: /opt/miniconda3
  • Auto-update disabled

For users: After installation, initialize conda in your shell:

/opt/miniconda3/bin/conda init

Common Tasks

Adding a New Machine to the Cluster

  1. Get the machine on the network with a Khoury IP (contact IT)

  2. Create initial local admin (manually, since Ansible can't connect yet):

    sudo adduser localdavidbau
    sudo usermod -aG sudo localdavidbau
    
  3. Copy your SSH key from your laptop:

    ssh-copy-id -o ProxyJump=login.khoury.northeastern.edu localdavidbau@newmachine
    
  4. Add to inventory - Edit /etc/ansible/hosts on baunames:

    [gpus]
    ...
    newmachine
    
  5. Run Ansible:

    cd /srv/ansible
    ansible-playbook main.yml -l newmachine
    
  6. Reboot if GPU machine:

    ansible-playbook reboot.yml -l newmachine
    

Updating Packages on a Specific Machine

ssh baunames
cd /srv/ansible
ansible-playbook packages.yml -l hostname

Updating NVIDIA Drivers

ansible-playbook cuda.yml -l hostname
ansible-playbook reboot.yml -l hostname

After reboot, verify:

ssh hostname
nvidia-smi

Updating LDAP Client Config

If you need to change how machines connect to LDAP:

  1. Edit files in /srv/ansible/ldap-client/
  2. Run:
    ansible-playbook ldapclient.yml
    

Careful! Test on one machine first with -l hostname.

Rebooting Machines

# One machine
ansible-playbook reboot.yml -l saitama

# All GPU machines (⚠️ interrupts everyone!)
ansible-playbook reboot.yml -l gpus

Configuration Files

On baunames

Path Purpose
/srv/ansible/ All 20.04 playbooks
/srv/ansible24/ All 24.04 playbooks
/srv/ansible/ldap-client/ SSSD/LDAP client config templates
/srv/ansible/slapd-config/ LDAP server configs
/srv/ansible/pubkeys/ SSH public keys for admin accounts
/etc/ansible/hosts Inventory of all machines

Files Deployed to Clients

File Source Purpose
/etc/sssd/sssd.conf ldap-client/sssd.conf SSSD configuration
/etc/ldap/ldap.conf ldap-client/ldap.conf LDAP client settings
/etc/nsswitch.conf ldap-client/nsswitch.conf Name service switch
/etc/auto.master autofs templates Autofs master map
/etc/auto.share autofs templates NFS share mappings

Troubleshooting

Playbook Fails

Read the error message! Ansible usually tells you exactly what went wrong.

Common issues:

  • Host unreachable: Machine is down or SSH isn't working
  • Permission denied: Your SSH key isn't on the target machine
  • Package not found: Ubuntu version mismatch (use the right playbook directory)

Testing a Single Task

Add --start-at-task to run from a specific point:

ansible-playbook packages.yml -l hostname --start-at-task="Install htop"

See What Would Change

Use --check for a dry run:

ansible-playbook packages.yml -l hostname --check

Verbose Output

Add -v, -vv, or -vvv for more detail:

ansible-playbook packages.yml -l hostname -vv

GitHub Backup

A backup of the ansible scripts is maintained at: https://github.com/thebaulab/khoury-ansible-scripts

This is useful if baunames is down and you need to set up a new control node.

Related Pages