Paul‐Pham‐Dev‐Diary‐2024‐04‐06 - TheEvergreenStateCollege/upper-division-cs-23-24 GitHub Wiki
6 April 2024 - Saturday
I picked up a 850 W power supply from Best Buy for about $96 because reviews online I read stated that the nVidia RTX 3080 might require up to 750 W at peak consumption during machine learning training tasks.
I powered off the machine to install the GPU into the PCIe slot. I couldn't find a separate space to mount the Corsair supply, since the existing no-name brand power supply is already taking up the usual space. So it's just hanging out on the carpet next to the open ATX case, which is dodgy long-term but okay for now as long as I move carefully around it and don't let cats or balloons into the office.
An anti-static mat, or cannibalizing an empty Dell ATX case from the Physical Computing Lab would work better for a later step when we productionize this.
I followed this article on installing latest NVIDIA drivers on ubuntu.
Here is the command-history of how I detected what NVIDIA device was available, and which drivers and utils I installed.
20014 nvidia-detector
20015 sudo nvidia-detector
20016 dmesg | grep NVIDIA
20017 sudo dmesg | grep NVIDIA
20018 sudo dmesg | grep nvidia
20019 cat /proc/driver/nvidia/version
20020 sudo ubuntu-drivers list --gpgpu
20021 sudo ubuntu-drivers install --gpgpu
20022 man ascii
20023 sudo ubuntu-drivers install --gpgpu
20024 sudo nvidia-detector
20025 sudo dmesg
20026 sudo dmesg | grep -i nvidia
20027 sudo reboot
20028 sudo nvidia-detector
20029 sudo ubuntu-drivers list --gpgpu
20030 sudo ubuntu-drivers install --gpgpu nvidia:550
20031 sudo ubuntu-drivers install --gpgpu nvidia:550-server
20032 sudo ubuntu-drivers install nvidia-utils-550-server
20033 sudo apt install nvidia-utils-550-server
20034 history | less
The detector utility told me which version of hardware driver I needed (I think)
$ sudo nvidia-detector
nvidia-driver-550
After installing the drivers, I got kernel messages making it seem like the hardware was detected.
s$ sudo dmesg | grep -i "nvidia"
[ 1.330136] nouveau 0000:01:00.0: NVIDIA GA102 (b72000a1)
[ 4.533535] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input11
[ 4.533560] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input12
[ 4.533583] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input13
[ 4.533601] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input14
[ 4.533664] input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input15
[ 4.533687] input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input16
[ 4.533704] input: HDA NVidia HDMI/DP,pcm=12 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input17
[ 4.735486] audit: type=1400 audit(1712441626.747:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1081 comm="apparmor_parser"
[ 4.735506] audit: type=1400 audit(1712441626.747:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1081 comm="apparmor_parser"
[ 15.163709] audit: type=1400 audit(1712441637.175:137): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe" pid=1932 comm="apparmor_parser"
However, nvidia-smi
which is a utility I've used in the past to see which GPUs are detectable for, e.g. Pytorch and general-purpose compute,
shows nothing.
$ sudo nvidia-smi -a
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Time to reboot.
GPU Detected
After rebooting, the previous nvidia-drivers
command was no longer available,
so I manually installed what seemed to be the correct driver from apt
package repositories
sudo apt install nvidia-headless-550-server
And then at last nvidia-smi
gives us some satisfaction:
==============NVSMI LOG==============
Timestamp : Sat Apr 6 18:17:21 2024
Driver Version : 550.54.14
CUDA Version : 12.4
Attached GPUs : 1
GPU 00000000:01:00.0
Product Name : NVIDIA GeForce RTX 3080
Product Brand : GeForce
Product Architecture : Ampere
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
Addressing Mode : None
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-76f7cb5e-d58b-4a1e-4daf-5acd40b5431e
Minor Number : 0
VBIOS Version : 94.02.71.80.C3
MultiGPU Board : No
Board ID : 0x100
Board Part Number : N/A
GPU Part Number : 2216-202-A1
FRU Part Number : N/A
Module ID : 1
Inforom Version
Image Version : G001.0000.03.03
OEM Object : 2.0
ECC Object : N/A
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU C2C Mode : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
vGPU Heterogeneous Mode : N/A
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : N/A
GSP Firmware Version : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Base Classcode : 0x3
Sub Classcode : 0x0
Device Id : 0x221610DE
Bus Id : 00000000:01:00.0
Sub System Id : 0xC8901028
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Device Current : 4
Device Max : 4
Host Max : 5
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : 30 %
Performance State : P0
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
Sparse Operation Mode : N/A
FB Memory Usage
Total : 10240 MiB
Reserved : 237 MiB
Used : 0 MiB
Free : 10002 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 1 MiB
Free : 16383 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable Parity : N/A
SRAM Uncorrectable SEC-DED : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable Parity : N/A
SRAM Uncorrectable SEC-DED : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
SRAM Threshold Exceeded : N/A
Aggregate Uncorrectable SRAM Sources
SRAM L2 : N/A
SRAM SM : N/A
SRAM Microcontroller : N/A
SRAM PCIE : N/A
SRAM Other : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows : N/A
Temperature
GPU Current Temp : 36 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 93 C
GPU Target Temperature : 83 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
GPU Power Readings
Power Draw : 76.30 W
Current Power Limit : 320.00 W
Requested Power Limit : 320.00 W
Default Power Limit : 320.00 W
Min Power Limit : 100.00 W
Max Power Limit : 320.00 W
GPU Memory Power Readings
Power Draw : N/A
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 1710 MHz
SM : 1710 MHz
Memory : 9501 MHz
Video : 1500 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 2100 MHz
SM : 2100 MHz
Memory : 9501 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 856.250 mV
Fabric
State : N/A
Status : N/A
CliqueId : N/A
ClusterUUID : N/A
Health
Bandwidth : N/A
Processes : None
Loading the facebook/galactica-125m
model succeeds, and running it in instruct
mode,
while not giving great results, at least succeeds now with a GPU.