Ansible Playbook Reference - thebaulab/onramp GitHub Wiki

Ansible Playbook Reference

This page documents the Ansible playbooks used to manage the baulab.us cluster.

What is Ansible and Why Do We Use It?

The Problem: Configuring 25+ Machines

Without automation, installing a package on all cluster machines means SSHing into each of 25+ machines and running the install command. Configuration changes, security updates, and package installations all require the same manual repetition.

The Solution: Ansible

Ansible is an automation tool that lets you describe what you want and then applies it to many machines at once:

# This says: "I want htop installed on all machines"
- name: Install htop
  apt:
    name: htop
    state: present

When you run this playbook, Ansible:

SSHs into all target machines (in parallel!)
Checks if htop is already installed
Installs it only if needed
Reports what it did

Key concept: Idempotence - Running the same playbook twice should have the same result as running it once. If htop is already installed, Ansible does nothing on the second run.

Why Not Just Write a Shell Script?

You could write:

for host in tokyo osaka karakuri ...; do
    ssh $host "apt install htop"
done

But this has problems:

No parallelism (slow)
No error handling
No idempotence (runs apt install every time)
No structured way to organize tasks
Hard to maintain complex configurations

Ansible solves all of these.

How Our Ansible Setup Works

The Control Node

Ansible runs from a "control node" - a machine that SSHes into all the others. In our case, that's baunames.

┌─────────────────────────────────────────────────────────────────────────┐
│                    BAUNAMES (Ansible Control Node)                      │
│                                                                         │
│   /srv/ansible/          <- Playbooks for Ubuntu 20.04                  │
│   /srv/ansible24/        <- Playbooks for Ubuntu 24.04                  │
│   /etc/ansible/hosts     <- Inventory (list of machines)                │
│                                                                         │
│   You run: ansible-playbook packages.yml                                │
│                                                                         │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │
                                 │ SSH (parallel connections)
                                 │
    ┌────────────────────────────┼────────────────────────────┐
    │                            │                            │
    ▼                            ▼                            ▼
┌─────────┐                ┌─────────┐                ┌─────────┐
│  tokyo  │                │  osaka  │                │karakuri │
│         │                │         │                │         │
│ htop    │                │ htop    │                │ htop    │
│installed│                │installed│                │installed│
└─────────┘                └─────────┘                └─────────┘

Two Playbook Directories

We have two sets of playbooks:

Directory	Target	Notes
`/srv/ansible/`	Ubuntu 20.04 machines	Current standard
`/srv/ansible24/`	Ubuntu 24.04 machines	For newer/migrated machines

Why two? Some package names or config file locations differ between Ubuntu versions. The 24.04 playbooks are updated for the newer OS.

The Inventory File

Ansible needs to know what machines exist. This is defined in /etc/ansible/hosts:

[ldapserver]
baunames
baunames24

[webservers]
bauserver
bauserver24

[gpus]
tokyo
karasuno
umibozu
hawaii
karakuri
kyoto
saitama
hokkaido
andromeda
kobe
osaka
# ... and more

[remote]
nagoya.research
hakone

The [brackets] define groups. Playbooks can target specific groups:

ansible-playbook cuda.yml -l gpus - Only GPU machines
ansible-playbook webserver.yml -l webservers - Only web servers

Running Ansible

Basic Commands

First, SSH to baunames as a local admin:

ssh baunames   # Uses your localdavidbau or localarnab account
cd /srv/ansible

Common commands:

# Run all playbooks on all hosts
ansible-playbook main.yml

# Run on a specific host only
ansible-playbook main.yml -l saitama

# Run a specific playbook
ansible-playbook cuda.yml -l gpus

# Dry run (see what would change without changing anything)
ansible-playbook packages.yml --check

# Verbose output (helpful for debugging)
ansible-playbook packages.yml -v

What `-l` (limit) Does

The -l flag limits which hosts to target:

# Only saitama
ansible-playbook main.yml -l saitama

# Only GPU machines
ansible-playbook main.yml -l gpus

# Multiple specific hosts
ansible-playbook main.yml -l "tokyo,osaka,karakuri"

Always use -l when testing - you don't want to accidentally break all machines at once!

Why Local Admin Accounts?

Notice we use localdavidbau or localarnab, not our LDAP accounts. Why?

The ansible playbook ldapclient.yml configures LDAP authentication itself. If we used LDAP accounts to run it and something broke, we'd lock ourselves out! Local accounts work even when LDAP is down.

Playbook Reference

main.yml - The Master Playbook

main.yml runs all the other playbooks in order:

- import_playbook: ldapclient.yml
- import_playbook: localadmin.yml
- import_playbook: autofs.yml
- import_playbook: packages.yml
- import_playbook: eternal.yml
- import_playbook: webserver.yml
- import_playbook: cuda.yml
- import_playbook: nvidiadocker.yml
- import_playbook: miniconda.yml

This is typically what you run to fully configure a new machine.

Playbooks for All Hosts

Playbook	What It Does	When to Run
`ldapclient.yml`	Configures SSSD for LDAP auth	New machine, or LDAP config changed
`localadmin.yml`	Creates localdavidbau/localarnab accounts	New machine, or adding admin
`autofs.yml`	Configures NFS automounting	New machine, or NFS config changed
`packages.yml`	Installs standard packages	New machine, or adding packages
`eternal.yml`	Installs Eternal Terminal	New machine

GPU-Only Playbooks

Playbook	What It Does	When to Run
`cuda.yml`	Installs NVIDIA drivers + CUDA	New GPU machine, or driver update
`nvidiadocker.yml`	Installs Docker with GPU support	Needed for containerized ML
`miniconda.yml`	Installs Miniconda to /opt	New machine
`driver.yml`	Fixes common NVIDIA driver issues	When GPUs aren't working
`gcloud.yml`	Installs Google Cloud SDK	If needed for GCP
`grubipv6.yml`	Disables IPv6 via GRUB	Troubleshooting network issues
`reboot.yml`	Reboots the host	After driver updates

Server-Only Playbooks

Playbook	What It Does	Target
`ldapserver.yml`	Configures OpenLDAP server	baunames only
`webserver.yml`	Installs Apache, Let's Encrypt	bauserver only

What Each Playbook Does (Details)

ldapclient.yml

This is the most important playbook - it makes LDAP authentication work.

What it configures:

Installs packages: sssd, sssd-ldap, ldap-utils, libnss-sss, libpam-sss
Copies the LDAP CA certificate to /usr/local/share/ca-certificates/
Deploys /etc/sssd/sssd.conf - tells SSSD how to reach baunames
Deploys /etc/ldap/ldap.conf - LDAP client settings
Deploys /etc/nsswitch.conf - tells the system to use SSSD for users/groups
Enables pam_mkhomedir - auto-creates home directories on first login
Restarts SSSD

When things go wrong: If users can't log in, this is often the culprit. Check /etc/sssd/sssd.conf and systemctl status sssd.

localadmin.yml

Creates emergency local admin accounts.

What it does:

Creates localdavidbau and localarnab users
Adds them to sudo group
Installs their SSH public keys (from /srv/ansible/pubkeys/)
Configures passwordless sudo

Why this matters: These accounts work even when LDAP is completely broken. Essential for recovery.

autofs.yml

Sets up automatic NFS mounting.

What it deploys:

/etc/auto.master - Tells autofs to manage /share
/etc/auto.share - Maps /share/* to NFS servers

Result: When you cd /share/u/yourname, autofs automatically mounts the NFS share.

packages.yml

Installs standard software everyone expects.

Packages installed:

Editors: vim, emacs
Terminals: tmux, screen, mosh, zsh
Network: openssh-server, curl, aria2
Monitoring: htop, glances
Development: git, python3-distutils
Documentation: texlive-full, pandoc
Utilities: jq, tree, ncdu, strace, tcpdump

Also removes: tracker and tracker-extract (file indexing that slows things down)

cuda.yml

Installs NVIDIA drivers and CUDA toolkit.

What it does:

Fixes any broken apt installations
Removes conflicting old drivers
Runs ubuntu-drivers autoinstall
Installs nvidia-cuda-toolkit

Important: After running this, you usually need to reboot:

ansible-playbook reboot.yml -l hostname

miniconda.yml

Installs a system-wide Miniconda.

Details:

Version: 23.11.0 with Python 3.11
Location: /opt/miniconda3
Auto-update disabled

For users: After installation, initialize conda in your shell:

/opt/miniconda3/bin/conda init

Common Tasks

Adding a New Machine to the Cluster

Get the machine on the network with a Khoury IP (contact IT)
Create initial local admin (manually, since Ansible can't connect yet):
```
sudo adduser localdavidbau
sudo usermod -aG sudo localdavidbau
```

Copy your SSH key from your laptop:

ssh-copy-id -o ProxyJump=login.khoury.northeastern.edu localdavidbau@newmachine

Add to inventory - Edit /etc/ansible/hosts on baunames:
```
[gpus]
...
newmachine
```

Run Ansible:

cd /srv/ansible
ansible-playbook main.yml -l newmachine

Reboot if GPU machine:

ansible-playbook reboot.yml -l newmachine

Updating Packages on a Specific Machine

ssh baunames
cd /srv/ansible
ansible-playbook packages.yml -l hostname

Updating NVIDIA Drivers

ansible-playbook cuda.yml -l hostname
ansible-playbook reboot.yml -l hostname

After reboot, verify:

ssh hostname
nvidia-smi

Updating LDAP Client Config

If you need to change how machines connect to LDAP:

Edit files in /srv/ansible/ldap-client/
Run:
```
ansible-playbook ldapclient.yml
```

Careful! Test on one machine first with -l hostname.

Rebooting Machines

# One machine
ansible-playbook reboot.yml -l saitama

# All GPU machines (⚠️ interrupts everyone!)
ansible-playbook reboot.yml -l gpus

Configuration Files

On baunames

Path	Purpose
`/srv/ansible/`	All 20.04 playbooks
`/srv/ansible24/`	All 24.04 playbooks
`/srv/ansible/ldap-client/`	SSSD/LDAP client config templates
`/srv/ansible/slapd-config/`	LDAP server configs
`/srv/ansible/pubkeys/`	SSH public keys for admin accounts
`/etc/ansible/hosts`	Inventory of all machines

Files Deployed to Clients

File	Source	Purpose
`/etc/sssd/sssd.conf`	ldap-client/sssd.conf	SSSD configuration
`/etc/ldap/ldap.conf`	ldap-client/ldap.conf	LDAP client settings
`/etc/nsswitch.conf`	ldap-client/nsswitch.conf	Name service switch
`/etc/auto.master`	autofs templates	Autofs master map
`/etc/auto.share`	autofs templates	NFS share mappings

Troubleshooting

Playbook Fails

Read the error message! Ansible usually tells you exactly what went wrong.

Common issues:

Host unreachable: Machine is down or SSH isn't working
Permission denied: Your SSH key isn't on the target machine
Package not found: Ubuntu version mismatch (use the right playbook directory)

Testing a Single Task

Add --start-at-task to run from a specific point:

ansible-playbook packages.yml -l hostname --start-at-task="Install htop"

See What Would Change

Use --check for a dry run:

ansible-playbook packages.yml -l hostname --check

Verbose Output

Add -v, -vv, or -vvv for more detail:

ansible-playbook packages.yml -l hostname -vv

GitHub Backup

A backup of the ansible scripts is maintained at: https://github.com/thebaulab/khoury-ansible-scripts

This is useful if baunames is down and you need to set up a new control node.

Server-Administration-Notes - Admin overview
LDAP-Server-Administration - LDAP details
NFS-Storage-Administration - NFS configuration
Network-Architecture - Network overview