【参考例子】安装训练环境和记录测试结果 - bettermorn/IAICourse GitHub Wiki

1 服务器环境

功能 参数/工具名称 版本号
操作系统 CentOS 7.7 内核linux mstation 3.10.0-1062.18.1.el7.x86 64
CPU 12vCPU -
内存 128GB -
GPU 2卡 NVIDIA GP100GL Quadro -
  • 查看 硬盘信息确定安装路径
df -h
fdisk -l

访问方法

  • SSH: putty
  • SFTP:fileZilla

2 安装基础应用

2.1 运行环境安装参考

Python 3.8

下载链接:https://www.python.org/downloads/release/python-3811/ 安装方法:https://packaging.python.org/guides/installing-using-linux-tools/

Anaconda

https://docs.anaconda.com/anaconda/install/linux/ 可在conda上建立多个环境,以便用于不同的模型训练

NVIDIA CUDA-enabled

CUDA Installation Guide for Linux 参考 https://docs.nvidia.com/cuda/archive/10.2/cuda-installation-guide-linux/index.html

cuDNN

用于深层神经网络的gpu加速库 下载: https://developer.nvidia.com/rdp/cudnn-download 安装:https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html

wget https://developer.nvidia.com/compute/machine-learning/cudnn/secure/8.2.4/10.2_20210831/cudnn-10.2-linux-x64-v8.2.4.15.tgz
tar -xzvf cudnn-10.2-linux-x64-v8.2.4.15.tgz

2.2 安装过程命令

  • Conda 2021.05
  • Python 3.8.11
  • CUDA 10.2
  • KB:custom install path only for shell file:
sudo sh cuda_10.2.89_440.33.01_linux.run  --silent \
                --toolkit --toolkitpath=/home/hpc/cuda/toolkit \
                --samples --samplespath=/home/hpc/cuda/samples \ 
                --defaultroot=/home/hpc/cuda --tmpdir=/home/hpc/cuda/tmp --installpath=/home/hpc/install  --extract=/home/hpc/extract   --no-man-page   --kernel-source-path=/home/hpc/kernel --librarypath=/home/hpc/lib 

refer to:

  1. How do I install the Toolkit in a different location? https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#faq1
  2. how can I install CUDA on linux on a custom path (not in /usr/local) https://superuser.com/questions/1131813/how-can-i-install-cuda-on-linux-on-a-custom-path-not-in-usr-local
  • cuDNN 8.1.1.33_CUDA11.2

3 路径设置

3.1 备用文件

  • /home/hpc/files

3.2 Conda 环境

  • tsai
  • cnnlstmdense

3.3 代码

  • /home/hpc/cnnlstmdense
  • data
  • models
  • outputs
  • /home/hpc/tsai
  • data
  • models
  • output

4 运行报告

编号 运行时长(分钟或者小时) epochs batch_size 模型和运行参数 评价指标:正确率%等 GPU cuDNN