NFS Storage Administration - thebaulab/onramp GitHub Wiki

NFS Storage Administration

This page documents the NFS storage infrastructure for the baulab.us cluster.

Understanding Network Storage

The Problem: Sharing Files Across Machines

Each machine has its own local disk. Without network storage, files created on one machine aren't accessible from another.

The Solution: Network File System (NFS)

NFS is a protocol that lets one computer (the "server") share its disk space with other computers (the "clients"). The clients can read and write files as if they were local, but the data actually lives on the server.

┌─────────────────────────────────────────────────────────────────────────┐
│                    NAS SERVER (baushare)                                │
│                                                                         │
│   84TB of disk space containing:                                        │
│   /volume1/u/davidbau/          <- David's files                        │
│   /volume1/u/arnab/             <- Arnab's files                        │
│   /volume1/u/yourname/          <- Your files                           │
│   /volume1/projects/demystify/  <- Shared project                       │
│                                                                         │
└────────────────────────────────────┬────────────────────────────────────┘
                                     │
                                     │ NFS (Network File System)
                                     │
            ┌────────────────────────┼────────────────────────────────┐
            │                        │                                │
            ▼                        ▼                                ▼
┌───────────────────┐    ┌───────────────────┐    ┌───────────────────┐
│      tokyo        │    │     karakuri      │    │      osaka        │
│                   │    │                   │    │                   │
│ /share/u/yourname │    │ /share/u/yourname │    │ /share/u/yourname │
│ (same files!)     │    │ (same files!)     │    │ (same files!)     │
└───────────────────┘    └───────────────────┘    └───────────────────┘

Result: Your home directory (/share/u/yourname) contains the same files no matter which machine you're on.

Why Synology NAS?

Our storage servers are Synology devices - these are specialized computers designed just for serving files. They're more reliable than a regular PC because:

  • RAID storage (if one disk fails, data is safe)
  • Hot-swappable drives (replace failed disks without downtime)
  • Purpose-built for 24/7 operation
  • Easy web-based administration

Storage Overview

We have two NAS servers:

Server IP Capacity Used Purpose
baushare 10.200.205.57 84TB 98% User homes, projects, backups
baunas24 10.200.204.169 98TB 56% Large datasets

WARNING: baushare is nearly full! When storing large files (datasets, model checkpoints), use /share/datasets/ on baunas24 instead.

Directory Structure

Here's how the storage is organized:

/share/                          <- Autofs mount point (see below)
│
├── u/                           <- User home directories (on baushare)
│   ├── davidbau/
│   ├── arnab/
│   ├── yourname/                <- Your home directory
│   └── ... (30+ users)
│
├── projects/                    <- Shared project directories (on baushare)
│   ├── demystify/               <- Each project gets a folder
│   ├── iclgrok/
│   ├── engine/
│   └── ...
│
├── backup/                      <- System backups (on baushare)
│   ├── ansible/
│   ├── config/
│   └── letsencrypt/
│
└── datasets/                    <- Large datasets (on baunas24)
    └── ...

Where Should I Put My Files?

Type of Data Location Why
Code, configs, small files /share/u/yourname/ Your home dir, accessible everywhere
Shared project files /share/projects/projectname/ Others can access too
Large datasets (>10GB) /share/datasets/ baunas24 has space; baushare is full
Temporary/scratch files /disk/ (local disk) Much faster, not shared
Conda environments Consider /disk/ Can be slow over NFS

How Autofs Works (The Magic Behind /share)

You might wonder: "How does /share/u/davidbau know to connect to baushare?" The answer is autofs.

The Problem with Regular Mounts

You could manually mount NFS shares like this:

sudo mount -t nfs baushare:/volume1/u /share/u

But there are problems:

  • You'd need to do this on every machine
  • If the NAS is temporarily unreachable, programs hang
  • All shares are mounted all the time, even if unused

The Solution: Automounting

Autofs is a service that mounts shares on demand - when you try to access them:

$ cd /share/u/davidbau
         │
         ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  1. Kernel: "Someone is trying to access /share/u/davidbau"             │
│                                                                         │
│  2. Autofs daemon wakes up, reads /etc/auto.share:                      │
│     "*  -rw,hard,bg,timeo=30  baushare:/volume1/&"                      │
│                                                                         │
│  3. Autofs mounts: baushare:/volume1/u → /share/u                       │
│                                                                         │
│  4. Now /share/u/davidbau is accessible!                                │
│                                                                         │
│  5. After 5 minutes of no access, autofs unmounts automatically         │
└─────────────────────────────────────────────────────────────────────────┘

The Configuration Files

/etc/auto.master - Tells autofs what to manage:

/share    /etc/auto.share

This means: "For anything under /share, look at /etc/auto.share for instructions."

/etc/auto.share - The actual mount rules:

*          -rw,hard,bg,timeo=30    baushare.khoury.northeastern.edu:/volume1/&
datasets   -rw,hard,bg,timeo=30    baunas24.khoury.northeastern.edu:/volume1/datasets

What this means:

  • * - Any directory name (like u, projects, backup)
  • -rw,hard,bg,timeo=30 - Mount options (read-write, don't give up if NAS is slow, etc.)
  • baushare:/volume1/& - The & is replaced with the directory name

So when you access /share/u, autofs mounts baushare:/volume1/u. When you access /share/projects, it mounts baushare:/volume1/projects.

The datasets line is special - it overrides the * rule to point to a different server (baunas24).

Performance Considerations

NFS is Slower Than Local Disk

NFS has to send data over the network. For small files or many file operations, this adds up:

Operation Local Disk (/disk) NFS (/share)
Read large file ~500 MB/s ~100 MB/s
Read 1000 small files 0.1 seconds 2-5 seconds
Random access Very fast Much slower

Rule of thumb: If you're doing heavy I/O (training models, processing datasets), copy data to local disk first:

# Copy dataset to local disk
cp -r /share/datasets/imagenet /disk/imagenet

# Use local copy for training
python train.py --data /disk/imagenet

Remote Machines Have More Latency

Machines at MGHPCC (nagoya, hakone) are physically far from the NAS servers at Khoury:

Machine Location Latency to NAS
tokyo, karakuri, etc. Khoury 177 ~0.5 ms
nagoya, hakone Holyoke, MA ~5-10 ms

For these remote machines, definitely use local /disk storage:

# On nagoya/hakone - set up local workspace
mkdir -p /disk/u/$(whoami)

# Optionally redirect home to local disk
export HOME=/disk/u/$(whoami)

# Move conda to local disk (huge speedup)
mkdir -p /disk/u/$(whoami)/dotconda
ln -sf /disk/u/$(whoami)/dotconda ~/.conda

Checking Storage Status

View Current Mounts

# See what's mounted
mount | grep share

# Example output:
# baushare:/volume1/u on /share/u type nfs4 (rw,...)
# baushare:/volume1/projects on /share/projects type nfs4 (rw,...)

Check Disk Usage

# How much space is used?
df -h /share/u /share/projects /share/datasets

# Example output:
# Filesystem                           Size  Used Avail Use% Mounted on
# baushare:/volume1/u                   84T   82T  2.0T  98% /share/u
# baunas24:/volume1/datasets            98T   55T   43T  56% /share/datasets

Check Autofs Status

# Is autofs running?
systemctl status autofs

# Restart if needed
sudo systemctl restart autofs

List What the NAS Exports

# See what shares are available
showmount -e baushare.khoury.northeastern.edu

# Example output:
# Export list for baushare.khoury.northeastern.edu:
# /volume1/backup   10.200.0.0/16
# /volume1/projects 10.200.0.0/16
# /volume1/u        10.200.0.0/16

NAS Administration

Accessing the Web Interface

Both NAS servers have a web-based admin interface (Synology DSM):

Server URL
baushare https://baushare.khoury.northeastern.edu:5001
baunas24 https://baunas24.khoury.northeastern.edu:5001

You'll need admin credentials (ask David or Arnab).

Common Admin Tasks

Check storage health:

  1. Log into DSM
  2. Click "Storage Manager"
  3. Check for any warning icons on volumes or drives

Manage NFS exports:

  1. Control Panel → Shared Folder
  2. Click on folder → Edit
  3. NFS Permissions tab

View connected clients:

  1. Control Panel → File Services
  2. Service Status tab shows active NFS connections

Making Project Directories Group-Writable

By default, files you create are only writable by you. For shared projects:

# Make directory writable by research group
chmod -R g+rwX /share/projects/myproject

# Ensure new files inherit group permissions
chmod g+s /share/projects/myproject

The g+s (setgid) flag means new files in that directory will belong to the same group as the directory.

Troubleshooting

"Stale file handle" Error

This happens when autofs's mount becomes out of sync with reality:

# Force autofs to remount
sudo systemctl restart autofs

# Or manually unmount and let autofs remount on next access
sudo umount -l /share/u
cd /share/u   # This triggers remount

Mount Not Working

Step 1: Check autofs

sudo systemctl status autofs
sudo systemctl restart autofs

Step 2: Test network connectivity

ping baushare.khoury.northeastern.edu

Step 3: Test NFS availability

showmount -e baushare.khoury.northeastern.edu

Step 4: Try manual mount (for debugging)

sudo mount -t nfs baushare.khoury.northeastern.edu:/volume1/u /mnt
ls /mnt
sudo umount /mnt

Permission Denied

Causes and solutions:

  • UID mismatch: Your local uid doesn't match your LDAP uid. Check: id yourusername
  • Not in group: Check: groups yourusername
  • NFS export restrictions: Admin needs to check NAS settings

Slow Performance

  • Are you on a remote machine (nagoya/hakone)? Use local /disk
  • Many small files? Copy to local disk first
  • Check network: ping baushare - latency should be <1ms locally

Backup Information

System backups are stored on /share/backup/:

Directory Contents Frequency
/share/backup/ansible/ Ansible playbook backups Manual
/share/backup/config/ LDAP config, system configs Weekly
/share/backup/letsencrypt/ SSL certificates After renewal

Note: User home directories are NOT automatically backed up. Important data should be committed to git or copied elsewhere.

Related Pages