NFS Storage Administration - thebaulab/onramp GitHub Wiki
NFS Storage Administration
This page documents the NFS storage infrastructure for the baulab.us cluster.
Understanding Network Storage
The Problem: Sharing Files Across Machines
Each machine has its own local disk. Without network storage, files created on one machine aren't accessible from another.
The Solution: Network File System (NFS)
NFS is a protocol that lets one computer (the "server") share its disk space with other computers (the "clients"). The clients can read and write files as if they were local, but the data actually lives on the server.
┌─────────────────────────────────────────────────────────────────────────┐
│ NAS SERVER (baushare) │
│ │
│ 84TB of disk space containing: │
│ /volume1/u/davidbau/ <- David's files │
│ /volume1/u/arnab/ <- Arnab's files │
│ /volume1/u/yourname/ <- Your files │
│ /volume1/projects/demystify/ <- Shared project │
│ │
└────────────────────────────────────┬────────────────────────────────────┘
│
│ NFS (Network File System)
│
┌────────────────────────┼────────────────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────────┐ ┌───────────────────┐ ┌───────────────────┐
│ tokyo │ │ karakuri │ │ osaka │
│ │ │ │ │ │
│ /share/u/yourname │ │ /share/u/yourname │ │ /share/u/yourname │
│ (same files!) │ │ (same files!) │ │ (same files!) │
└───────────────────┘ └───────────────────┘ └───────────────────┘
Result: Your home directory (/share/u/yourname) contains the same files no matter which machine you're on.
Why Synology NAS?
Our storage servers are Synology devices - these are specialized computers designed just for serving files. They're more reliable than a regular PC because:
- RAID storage (if one disk fails, data is safe)
- Hot-swappable drives (replace failed disks without downtime)
- Purpose-built for 24/7 operation
- Easy web-based administration
Storage Overview
We have two NAS servers:
| Server | IP | Capacity | Used | Purpose |
|---|---|---|---|---|
| baushare | 10.200.205.57 | 84TB | 98% | User homes, projects, backups |
| baunas24 | 10.200.204.169 | 98TB | 56% | Large datasets |
WARNING: baushare is nearly full! When storing large files (datasets, model checkpoints), use /share/datasets/ on baunas24 instead.
Directory Structure
Here's how the storage is organized:
/share/ <- Autofs mount point (see below)
│
├── u/ <- User home directories (on baushare)
│ ├── davidbau/
│ ├── arnab/
│ ├── yourname/ <- Your home directory
│ └── ... (30+ users)
│
├── projects/ <- Shared project directories (on baushare)
│ ├── demystify/ <- Each project gets a folder
│ ├── iclgrok/
│ ├── engine/
│ └── ...
│
├── backup/ <- System backups (on baushare)
│ ├── ansible/
│ ├── config/
│ └── letsencrypt/
│
└── datasets/ <- Large datasets (on baunas24)
└── ...
Where Should I Put My Files?
| Type of Data | Location | Why |
|---|---|---|
| Code, configs, small files | /share/u/yourname/ |
Your home dir, accessible everywhere |
| Shared project files | /share/projects/projectname/ |
Others can access too |
| Large datasets (>10GB) | /share/datasets/ |
baunas24 has space; baushare is full |
| Temporary/scratch files | /disk/ (local disk) |
Much faster, not shared |
| Conda environments | Consider /disk/ |
Can be slow over NFS |
How Autofs Works (The Magic Behind /share)
You might wonder: "How does /share/u/davidbau know to connect to baushare?" The answer is autofs.
The Problem with Regular Mounts
You could manually mount NFS shares like this:
sudo mount -t nfs baushare:/volume1/u /share/u
But there are problems:
- You'd need to do this on every machine
- If the NAS is temporarily unreachable, programs hang
- All shares are mounted all the time, even if unused
The Solution: Automounting
Autofs is a service that mounts shares on demand - when you try to access them:
$ cd /share/u/davidbau
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ 1. Kernel: "Someone is trying to access /share/u/davidbau" │
│ │
│ 2. Autofs daemon wakes up, reads /etc/auto.share: │
│ "* -rw,hard,bg,timeo=30 baushare:/volume1/&" │
│ │
│ 3. Autofs mounts: baushare:/volume1/u → /share/u │
│ │
│ 4. Now /share/u/davidbau is accessible! │
│ │
│ 5. After 5 minutes of no access, autofs unmounts automatically │
└─────────────────────────────────────────────────────────────────────────┘
The Configuration Files
/etc/auto.master - Tells autofs what to manage:
/share /etc/auto.share
This means: "For anything under /share, look at /etc/auto.share for instructions."
/etc/auto.share - The actual mount rules:
* -rw,hard,bg,timeo=30 baushare.khoury.northeastern.edu:/volume1/&
datasets -rw,hard,bg,timeo=30 baunas24.khoury.northeastern.edu:/volume1/datasets
What this means:
*- Any directory name (likeu,projects,backup)-rw,hard,bg,timeo=30- Mount options (read-write, don't give up if NAS is slow, etc.)baushare:/volume1/&- The&is replaced with the directory name
So when you access /share/u, autofs mounts baushare:/volume1/u. When you access /share/projects, it mounts baushare:/volume1/projects.
The datasets line is special - it overrides the * rule to point to a different server (baunas24).
Performance Considerations
NFS is Slower Than Local Disk
NFS has to send data over the network. For small files or many file operations, this adds up:
| Operation | Local Disk (/disk) | NFS (/share) |
|---|---|---|
| Read large file | ~500 MB/s | ~100 MB/s |
| Read 1000 small files | 0.1 seconds | 2-5 seconds |
| Random access | Very fast | Much slower |
Rule of thumb: If you're doing heavy I/O (training models, processing datasets), copy data to local disk first:
# Copy dataset to local disk
cp -r /share/datasets/imagenet /disk/imagenet
# Use local copy for training
python train.py --data /disk/imagenet
Remote Machines Have More Latency
Machines at MGHPCC (nagoya, hakone) are physically far from the NAS servers at Khoury:
| Machine | Location | Latency to NAS |
|---|---|---|
| tokyo, karakuri, etc. | Khoury 177 | ~0.5 ms |
| nagoya, hakone | Holyoke, MA | ~5-10 ms |
For these remote machines, definitely use local /disk storage:
# On nagoya/hakone - set up local workspace
mkdir -p /disk/u/$(whoami)
# Optionally redirect home to local disk
export HOME=/disk/u/$(whoami)
# Move conda to local disk (huge speedup)
mkdir -p /disk/u/$(whoami)/dotconda
ln -sf /disk/u/$(whoami)/dotconda ~/.conda
Checking Storage Status
View Current Mounts
# See what's mounted
mount | grep share
# Example output:
# baushare:/volume1/u on /share/u type nfs4 (rw,...)
# baushare:/volume1/projects on /share/projects type nfs4 (rw,...)
Check Disk Usage
# How much space is used?
df -h /share/u /share/projects /share/datasets
# Example output:
# Filesystem Size Used Avail Use% Mounted on
# baushare:/volume1/u 84T 82T 2.0T 98% /share/u
# baunas24:/volume1/datasets 98T 55T 43T 56% /share/datasets
Check Autofs Status
# Is autofs running?
systemctl status autofs
# Restart if needed
sudo systemctl restart autofs
List What the NAS Exports
# See what shares are available
showmount -e baushare.khoury.northeastern.edu
# Example output:
# Export list for baushare.khoury.northeastern.edu:
# /volume1/backup 10.200.0.0/16
# /volume1/projects 10.200.0.0/16
# /volume1/u 10.200.0.0/16
NAS Administration
Accessing the Web Interface
Both NAS servers have a web-based admin interface (Synology DSM):
| Server | URL |
|---|---|
| baushare | https://baushare.khoury.northeastern.edu:5001 |
| baunas24 | https://baunas24.khoury.northeastern.edu:5001 |
You'll need admin credentials (ask David or Arnab).
Common Admin Tasks
Check storage health:
- Log into DSM
- Click "Storage Manager"
- Check for any warning icons on volumes or drives
Manage NFS exports:
- Control Panel → Shared Folder
- Click on folder → Edit
- NFS Permissions tab
View connected clients:
- Control Panel → File Services
- Service Status tab shows active NFS connections
Making Project Directories Group-Writable
By default, files you create are only writable by you. For shared projects:
# Make directory writable by research group
chmod -R g+rwX /share/projects/myproject
# Ensure new files inherit group permissions
chmod g+s /share/projects/myproject
The g+s (setgid) flag means new files in that directory will belong to the same group as the directory.
Troubleshooting
"Stale file handle" Error
This happens when autofs's mount becomes out of sync with reality:
# Force autofs to remount
sudo systemctl restart autofs
# Or manually unmount and let autofs remount on next access
sudo umount -l /share/u
cd /share/u # This triggers remount
Mount Not Working
Step 1: Check autofs
sudo systemctl status autofs
sudo systemctl restart autofs
Step 2: Test network connectivity
ping baushare.khoury.northeastern.edu
Step 3: Test NFS availability
showmount -e baushare.khoury.northeastern.edu
Step 4: Try manual mount (for debugging)
sudo mount -t nfs baushare.khoury.northeastern.edu:/volume1/u /mnt
ls /mnt
sudo umount /mnt
Permission Denied
Causes and solutions:
- UID mismatch: Your local uid doesn't match your LDAP uid. Check:
id yourusername - Not in group: Check:
groups yourusername - NFS export restrictions: Admin needs to check NAS settings
Slow Performance
- Are you on a remote machine (nagoya/hakone)? Use local
/disk - Many small files? Copy to local disk first
- Check network:
ping baushare- latency should be <1ms locally
Backup Information
System backups are stored on /share/backup/:
| Directory | Contents | Frequency |
|---|---|---|
/share/backup/ansible/ |
Ansible playbook backups | Manual |
/share/backup/config/ |
LDAP config, system configs | Weekly |
/share/backup/letsencrypt/ |
SSL certificates | After renewal |
Note: User home directories are NOT automatically backed up. Important data should be committed to git or copied elsewhere.
Related Pages
- Filesystems-on-bau-clusters - User guide to filesystems
- Network-Architecture - Network overview
- Using-Nagoya - Tips for remote machines
- Ansible-Playbook-Reference - How autofs is configured via Ansible