Storage - eunki-7/llm-rdma-mlops-lab GitHub Wiki

Storage

Shared storage for models, datasets, and training outputs across all nodes.


⚙️ Setup

  • node0: acts as the NFS server
  • node1 ~ node3: act as NFS clients

Directories

  • /models → HuggingFace model cache, checkpoints
  • /data → datasets (JSONL, etc.)
  • /outputs → training outputs and logs

Optional NVMe Caching

  • Sync frequently used models to local NVMe storage for faster load times:

    rsync -av /models/hf_cache /local_nvme/models/hf_cache
    export HF_HOME=/local_nvme/models/hf_cache
    export TRANSFORMERS_CACHE=/local_nvme/models/hf_cache
    

🖼️ Storage Diagram