Paul‐Pham‐Dev‐Diary‐2024‐04‐06 - TheEvergreenStateCollege/upper-division-cs-23-24 GitHub Wiki

6 April 2024 - Saturday

I picked up a 850 W power supply from Best Buy for about $96 because reviews online I read stated that the nVidia RTX 3080 might require up to 750 W at peak consumption during machine learning training tasks.

I powered off the machine to install the GPU into the PCIe slot. I couldn't find a separate space to mount the Corsair supply, since the existing no-name brand power supply is already taking up the usual space. So it's just hanging out on the carpet next to the open ATX case, which is dodgy long-term but okay for now as long as I move carefully around it and don't let cats or balloons into the office.

An anti-static mat, or cannibalizing an empty Dell ATX case from the Physical Computing Lab would work better for a later step when we productionize this.

I followed this article on installing latest NVIDIA drivers on ubuntu.

Here is the command-history of how I detected what NVIDIA device was available, and which drivers and utils I installed.

20014  nvidia-detector
20015  sudo nvidia-detector
20016  dmesg | grep NVIDIA
20017  sudo dmesg | grep NVIDIA
20018  sudo dmesg | grep nvidia
20019  cat /proc/driver/nvidia/version
20020  sudo ubuntu-drivers list --gpgpu
20021  sudo ubuntu-drivers install --gpgpu
20022  man ascii
20023  sudo ubuntu-drivers install --gpgpu
20024  sudo nvidia-detector
20025  sudo dmesg
20026  sudo dmesg | grep -i nvidia
20027  sudo reboot
20028  sudo nvidia-detector
20029  sudo ubuntu-drivers list --gpgpu
20030  sudo ubuntu-drivers install --gpgpu nvidia:550
20031  sudo ubuntu-drivers install --gpgpu nvidia:550-server
20032  sudo ubuntu-drivers install nvidia-utils-550-server
20033  sudo apt install nvidia-utils-550-server
20034  history | less

The detector utility told me which version of hardware driver I needed (I think)

$ sudo nvidia-detector
nvidia-driver-550

After installing the drivers, I got kernel messages making it seem like the hardware was detected.

s$ sudo dmesg | grep -i "nvidia"
[    1.330136] nouveau 0000:01:00.0: NVIDIA GA102 (b72000a1)
[    4.533535] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input11
[    4.533560] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input12
[    4.533583] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input13
[    4.533601] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input14
[    4.533664] input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input15
[    4.533687] input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input16
[    4.533704] input: HDA NVidia HDMI/DP,pcm=12 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card1/input17
[    4.735486] audit: type=1400 audit(1712441626.747:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1081 comm="apparmor_parser"
[    4.735506] audit: type=1400 audit(1712441626.747:8): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1081 comm="apparmor_parser"
[   15.163709] audit: type=1400 audit(1712441637.175:137): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="nvidia_modprobe" pid=1932 comm="apparmor_parser"

However, nvidia-smi which is a utility I've used in the past to see which GPUs are detectable for, e.g. Pytorch and general-purpose compute, shows nothing.

$ sudo nvidia-smi -a
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Time to reboot.

GPU Detected

After rebooting, the previous nvidia-drivers command was no longer available, so I manually installed what seemed to be the correct driver from apt package repositories

sudo apt install nvidia-headless-550-server

And then at last nvidia-smi gives us some satisfaction:


==============NVSMI LOG==============

Timestamp                                 : Sat Apr  6 18:17:21 2024
Driver Version                            : 550.54.14
CUDA Version                              : 12.4

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Product Name                          : NVIDIA GeForce RTX 3080
    Product Brand                         : GeForce
    Product Architecture                  : Ampere
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    Addressing Mode                       : None
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-76f7cb5e-d58b-4a1e-4daf-5acd40b5431e
    Minor Number                          : 0
    VBIOS Version                         : 94.02.71.80.C3
    MultiGPU Board                        : No
    Board ID                              : 0x100
    Board Part Number                     : N/A
    GPU Part Number                       : 2216-202-A1
    FRU Part Number                       : N/A
    Module ID                             : 1
    Inforom Version
        Image Version                     : G001.0000.03.03
        OEM Object                        : 2.0
        ECC Object                        : N/A
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU C2C Mode                          : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
        vGPU Heterogeneous Mode           : N/A
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : N/A
    GSP Firmware Version                  : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x01
        Device                            : 0x00
        Domain                            : 0x0000
        Base Classcode                    : 0x3
        Sub Classcode                     : 0x0
        Device Id                         : 0x221610DE
        Bus Id                            : 00000000:01:00.0
        Sub System Id                     : 0xC8901028
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
                Device Current            : 4
                Device Max                : 4
                Host Max                  : 5
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : 30 %
    Performance State                     : P0
    Clocks Event Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    Sparse Operation Mode                 : N/A
    FB Memory Usage
        Total                             : 10240 MiB
        Reserved                          : 237 MiB
        Used                              : 0 MiB
        Free                              : 10002 MiB
    BAR1 Memory Usage
        Total                             : 16384 MiB
        Used                              : 1 MiB
        Free                              : 16383 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable Parity     : N/A
            SRAM Uncorrectable SEC-DED    : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable Parity     : N/A
            SRAM Uncorrectable SEC-DED    : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
            SRAM Threshold Exceeded       : N/A
        Aggregate Uncorrectable SRAM Sources
            SRAM L2                       : N/A
            SRAM SM                       : N/A
            SRAM Microcontroller          : N/A
            SRAM PCIE                     : N/A
            SRAM Other                    : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 36 C
        GPU T.Limit Temp                  : N/A
        GPU Shutdown Temp                 : 98 C
        GPU Slowdown Temp                 : 95 C
        GPU Max Operating Temp            : 93 C
        GPU Target Temperature            : 83 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    GPU Power Readings
        Power Draw                        : 76.30 W
        Current Power Limit               : 320.00 W
        Requested Power Limit             : 320.00 W
        Default Power Limit               : 320.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 320.00 W
    GPU Memory Power Readings 
        Power Draw                        : N/A
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 1710 MHz
        SM                                : 1710 MHz
        Memory                            : 9501 MHz
        Video                             : 1500 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2100 MHz
        SM                                : 2100 MHz
        Memory                            : 9501 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 856.250 mV
    Fabric
        State                             : N/A
        Status                            : N/A
        CliqueId                          : N/A
        ClusterUUID                       : N/A
        Health
            Bandwidth                     : N/A
    Processes                             : None

Loading the facebook/galactica-125m model succeeds, and running it in instruct mode, while not giving great results, at least succeeds now with a GPU.

image

image