Ansible Playbook Reference - thebaulab/onramp GitHub Wiki
Ansible Playbook Reference
This page documents the Ansible playbooks used to manage the baulab.us cluster.
What is Ansible and Why Do We Use It?
The Problem: Configuring 25+ Machines
Without automation, installing a package on all cluster machines means SSHing into each of 25+ machines and running the install command. Configuration changes, security updates, and package installations all require the same manual repetition.
The Solution: Ansible
Ansible is an automation tool that lets you describe what you want and then applies it to many machines at once:
# This says: "I want htop installed on all machines"
- name: Install htop
apt:
name: htop
state: present
When you run this playbook, Ansible:
- SSHs into all target machines (in parallel!)
- Checks if htop is already installed
- Installs it only if needed
- Reports what it did
Key concept: Idempotence - Running the same playbook twice should have the same result as running it once. If htop is already installed, Ansible does nothing on the second run.
Why Not Just Write a Shell Script?
You could write:
for host in tokyo osaka karakuri ...; do
ssh $host "apt install htop"
done
But this has problems:
- No parallelism (slow)
- No error handling
- No idempotence (runs
apt installevery time) - No structured way to organize tasks
- Hard to maintain complex configurations
Ansible solves all of these.
How Our Ansible Setup Works
The Control Node
Ansible runs from a "control node" - a machine that SSHes into all the others. In our case, that's baunames.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BAUNAMES (Ansible Control Node) β
β β
β /srv/ansible/ <- Playbooks for Ubuntu 20.04 β
β /srv/ansible24/ <- Playbooks for Ubuntu 24.04 β
β /etc/ansible/hosts <- Inventory (list of machines) β
β β
β You run: ansible-playbook packages.yml β
β β
ββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
β SSH (parallel connections)
β
ββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββ βββββββββββ βββββββββββ
β tokyo β β osaka β βkarakuri β
β β β β β β
β htop β β htop β β htop β
βinstalledβ βinstalledβ βinstalledβ
βββββββββββ βββββββββββ βββββββββββ
Two Playbook Directories
We have two sets of playbooks:
| Directory | Target | Notes |
|---|---|---|
/srv/ansible/ |
Ubuntu 20.04 machines | Current standard |
/srv/ansible24/ |
Ubuntu 24.04 machines | For newer/migrated machines |
Why two? Some package names or config file locations differ between Ubuntu versions. The 24.04 playbooks are updated for the newer OS.
The Inventory File
Ansible needs to know what machines exist. This is defined in /etc/ansible/hosts:
[ldapserver]
baunames
baunames24
[webservers]
bauserver
bauserver24
[gpus]
tokyo
karasuno
umibozu
hawaii
karakuri
kyoto
saitama
hokkaido
andromeda
kobe
osaka
# ... and more
[remote]
nagoya.research
hakone
The [brackets] define groups. Playbooks can target specific groups:
ansible-playbook cuda.yml -l gpus- Only GPU machinesansible-playbook webserver.yml -l webservers- Only web servers
Running Ansible
Basic Commands
First, SSH to baunames as a local admin:
ssh baunames # Uses your localdavidbau or localarnab account
cd /srv/ansible
Common commands:
# Run all playbooks on all hosts
ansible-playbook main.yml
# Run on a specific host only
ansible-playbook main.yml -l saitama
# Run a specific playbook
ansible-playbook cuda.yml -l gpus
# Dry run (see what would change without changing anything)
ansible-playbook packages.yml --check
# Verbose output (helpful for debugging)
ansible-playbook packages.yml -v
What -l (limit) Does
The -l flag limits which hosts to target:
# Only saitama
ansible-playbook main.yml -l saitama
# Only GPU machines
ansible-playbook main.yml -l gpus
# Multiple specific hosts
ansible-playbook main.yml -l "tokyo,osaka,karakuri"
Always use -l when testing - you don't want to accidentally break all machines at once!
Why Local Admin Accounts?
Notice we use localdavidbau or localarnab, not our LDAP accounts. Why?
The ansible playbook ldapclient.yml configures LDAP authentication itself. If we used LDAP accounts to run it and something broke, we'd lock ourselves out! Local accounts work even when LDAP is down.
Playbook Reference
main.yml - The Master Playbook
main.yml runs all the other playbooks in order:
- import_playbook: ldapclient.yml
- import_playbook: localadmin.yml
- import_playbook: autofs.yml
- import_playbook: packages.yml
- import_playbook: eternal.yml
- import_playbook: webserver.yml
- import_playbook: cuda.yml
- import_playbook: nvidiadocker.yml
- import_playbook: miniconda.yml
This is typically what you run to fully configure a new machine.
Playbooks for All Hosts
| Playbook | What It Does | When to Run |
|---|---|---|
ldapclient.yml |
Configures SSSD for LDAP auth | New machine, or LDAP config changed |
localadmin.yml |
Creates localdavidbau/localarnab accounts | New machine, or adding admin |
autofs.yml |
Configures NFS automounting | New machine, or NFS config changed |
packages.yml |
Installs standard packages | New machine, or adding packages |
eternal.yml |
Installs Eternal Terminal | New machine |
GPU-Only Playbooks
| Playbook | What It Does | When to Run |
|---|---|---|
cuda.yml |
Installs NVIDIA drivers + CUDA | New GPU machine, or driver update |
nvidiadocker.yml |
Installs Docker with GPU support | Needed for containerized ML |
miniconda.yml |
Installs Miniconda to /opt | New machine |
driver.yml |
Fixes common NVIDIA driver issues | When GPUs aren't working |
gcloud.yml |
Installs Google Cloud SDK | If needed for GCP |
grubipv6.yml |
Disables IPv6 via GRUB | Troubleshooting network issues |
reboot.yml |
Reboots the host | After driver updates |
Server-Only Playbooks
| Playbook | What It Does | Target |
|---|---|---|
ldapserver.yml |
Configures OpenLDAP server | baunames only |
webserver.yml |
Installs Apache, Let's Encrypt | bauserver only |
What Each Playbook Does (Details)
ldapclient.yml
This is the most important playbook - it makes LDAP authentication work.
What it configures:
- Installs packages:
sssd,sssd-ldap,ldap-utils,libnss-sss,libpam-sss - Copies the LDAP CA certificate to
/usr/local/share/ca-certificates/ - Deploys
/etc/sssd/sssd.conf- tells SSSD how to reach baunames - Deploys
/etc/ldap/ldap.conf- LDAP client settings - Deploys
/etc/nsswitch.conf- tells the system to use SSSD for users/groups - Enables
pam_mkhomedir- auto-creates home directories on first login - Restarts SSSD
When things go wrong: If users can't log in, this is often the culprit. Check /etc/sssd/sssd.conf and systemctl status sssd.
localadmin.yml
Creates emergency local admin accounts.
What it does:
- Creates
localdavidbauandlocalarnabusers - Adds them to
sudogroup - Installs their SSH public keys (from
/srv/ansible/pubkeys/) - Configures passwordless sudo
Why this matters: These accounts work even when LDAP is completely broken. Essential for recovery.
autofs.yml
Sets up automatic NFS mounting.
What it deploys:
/etc/auto.master- Tells autofs to manage/share/etc/auto.share- Maps/share/*to NFS servers
Result: When you cd /share/u/yourname, autofs automatically mounts the NFS share.
packages.yml
Installs standard software everyone expects.
Packages installed:
- Editors: vim, emacs
- Terminals: tmux, screen, mosh, zsh
- Network: openssh-server, curl, aria2
- Monitoring: htop, glances
- Development: git, python3-distutils
- Documentation: texlive-full, pandoc
- Utilities: jq, tree, ncdu, strace, tcpdump
Also removes: tracker and tracker-extract (file indexing that slows things down)
cuda.yml
Installs NVIDIA drivers and CUDA toolkit.
What it does:
- Fixes any broken apt installations
- Removes conflicting old drivers
- Runs
ubuntu-drivers autoinstall - Installs
nvidia-cuda-toolkit
Important: After running this, you usually need to reboot:
ansible-playbook reboot.yml -l hostname
miniconda.yml
Installs a system-wide Miniconda.
Details:
- Version: 23.11.0 with Python 3.11
- Location:
/opt/miniconda3 - Auto-update disabled
For users: After installation, initialize conda in your shell:
/opt/miniconda3/bin/conda init
Common Tasks
Adding a New Machine to the Cluster
-
Get the machine on the network with a Khoury IP (contact IT)
-
Create initial local admin (manually, since Ansible can't connect yet):
sudo adduser localdavidbau sudo usermod -aG sudo localdavidbau -
Copy your SSH key from your laptop:
ssh-copy-id -o ProxyJump=login.khoury.northeastern.edu localdavidbau@newmachine -
Add to inventory - Edit
/etc/ansible/hostson baunames:[gpus] ... newmachine -
Run Ansible:
cd /srv/ansible ansible-playbook main.yml -l newmachine -
Reboot if GPU machine:
ansible-playbook reboot.yml -l newmachine
Updating Packages on a Specific Machine
ssh baunames
cd /srv/ansible
ansible-playbook packages.yml -l hostname
Updating NVIDIA Drivers
ansible-playbook cuda.yml -l hostname
ansible-playbook reboot.yml -l hostname
After reboot, verify:
ssh hostname
nvidia-smi
Updating LDAP Client Config
If you need to change how machines connect to LDAP:
- Edit files in
/srv/ansible/ldap-client/ - Run:
ansible-playbook ldapclient.yml
Careful! Test on one machine first with -l hostname.
Rebooting Machines
# One machine
ansible-playbook reboot.yml -l saitama
# All GPU machines (β οΈ interrupts everyone!)
ansible-playbook reboot.yml -l gpus
Configuration Files
On baunames
| Path | Purpose |
|---|---|
/srv/ansible/ |
All 20.04 playbooks |
/srv/ansible24/ |
All 24.04 playbooks |
/srv/ansible/ldap-client/ |
SSSD/LDAP client config templates |
/srv/ansible/slapd-config/ |
LDAP server configs |
/srv/ansible/pubkeys/ |
SSH public keys for admin accounts |
/etc/ansible/hosts |
Inventory of all machines |
Files Deployed to Clients
| File | Source | Purpose |
|---|---|---|
/etc/sssd/sssd.conf |
ldap-client/sssd.conf | SSSD configuration |
/etc/ldap/ldap.conf |
ldap-client/ldap.conf | LDAP client settings |
/etc/nsswitch.conf |
ldap-client/nsswitch.conf | Name service switch |
/etc/auto.master |
autofs templates | Autofs master map |
/etc/auto.share |
autofs templates | NFS share mappings |
Troubleshooting
Playbook Fails
Read the error message! Ansible usually tells you exactly what went wrong.
Common issues:
- Host unreachable: Machine is down or SSH isn't working
- Permission denied: Your SSH key isn't on the target machine
- Package not found: Ubuntu version mismatch (use the right playbook directory)
Testing a Single Task
Add --start-at-task to run from a specific point:
ansible-playbook packages.yml -l hostname --start-at-task="Install htop"
See What Would Change
Use --check for a dry run:
ansible-playbook packages.yml -l hostname --check
Verbose Output
Add -v, -vv, or -vvv for more detail:
ansible-playbook packages.yml -l hostname -vv
GitHub Backup
A backup of the ansible scripts is maintained at: https://github.com/thebaulab/khoury-ansible-scripts
This is useful if baunames is down and you need to set up a new control node.
Related Pages
- Server-Administration-Notes - Admin overview
- LDAP-Server-Administration - LDAP details
- NFS-Storage-Administration - NFS configuration
- Network-Architecture - Network overview