Storage - eunki-7/llm-rdma-mlops-lab GitHub Wiki
Storage
Shared storage for models, datasets, and training outputs across all nodes.
⚙️ Setup
- node0: acts as the NFS server
- node1 ~ node3: act as NFS clients
Directories
/models
→ HuggingFace model cache, checkpoints/data
→ datasets (JSONL, etc.)/outputs
→ training outputs and logs
Optional NVMe Caching
-
Sync frequently used models to local NVMe storage for faster load times:
rsync -av /models/hf_cache /local_nvme/models/hf_cache export HF_HOME=/local_nvme/models/hf_cache export TRANSFORMERS_CACHE=/local_nvme/models/hf_cache