GPU Profiling - ProkopHapala/FireCore GitHub Wiki

CUDA / NVIDIA Nsight

I made CUDA reimplementation of main kernels for MMFF simulation in branch prokop
- /cl/relax_multi.cu is the CUDA code
- cuMMFF_lib.cpp is a simple shared library which expose calls to CUDA kernels as extern "C" function calls, and allows to manage CUDA state (crerate/upload/download buffers ...)
- cuMMFF.py is python ctypes binding to cuMMFF_lib.cpp
- test_mmff_cuda_vs_ocl.py test script which can run both CUDA and pyOpenCL implementation of MMFF
How to run NVIDIA Nsight on test_mmff_cuda_vs_ocl.py
- see script run_nvsight.sh
- I just follow instructions from perplexity on how to use it see
  - To be honest, I tested it on my very old and weak laptop GPU GTX 1650, which lack a lot of features. For our better GPUs there should be much better --metrics available
- I can run visual Nsight application, but I did not spent any time figuring out how to use it

The output form run_nvsight.sh looks like this:

################# RUN CUDA MMFF #################
CUDA MD time:  1.5100074450001557
CU_MM object created (Buffer Map version).
CU_MM::init(nSys=500, nAtoms=50, nNode=16, npbc=1, nMaxSysN=4)
CU_MM::init() finished successfully (Buffer Map version).
ff.synchronize()
cuMMFF_lib.cpp run_MD(): nstep 10
==PROF== Disconnected from process 8274
CU_MM::cleanup() freeing 24 GPU buffers...
CU_MM::cleanup() finished.
CU_MM object destroyed (Buffer Map version).
[8274] [email protected]
  getMMFFf4(int4, float4 *, float4 *, float4 *, int4 *, int4 *, float4 *, float4 *, float4 *, float4 *, float4 *, float4 *, cu_Mat3 *, cu_Mat3 *, float4 *, int, int) (1, 500, 1)x(32, 1, 1), Context 1, Stream 7, Device 0, CC 7.5
    Section: Command line profiler metrics
    ------------------------------------------------- ----------- ------------
    Metric Name                                       Metric Unit Metric Value
    ------------------------------------------------- ----------- ------------
    dram__bytes_read.avg                                    Mbyte         3.58
    dram__bytes_read.max                                    Mbyte         3.63
    dram__bytes_read.min                                    Mbyte         3.51
    dram__bytes_read.sum                                    Mbyte        14.33
    dram__bytes_write.avg                                   Mbyte         7.79
    dram__bytes_write.max                                   Mbyte         7.97
    dram__bytes_write.min                                   Mbyte         7.64
    dram__bytes_write.sum                                   Mbyte        31.15
    l1tex__t_sectors_pipe_lsu_mem_global_op_ld.avg         sector     9,377.21
    l1tex__t_sectors_pipe_lsu_mem_global_op_ld.max         sector       11,291
    l1tex__t_sectors_pipe_lsu_mem_global_op_ld.min         sector        8,654
    l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum         sector      131,281
    l1tex__t_sectors_pipe_lsu_mem_global_op_st.avg         sector    19,428.57
    l1tex__t_sectors_pipe_lsu_mem_global_op_st.max         sector       23,392
    l1tex__t_sectors_pipe_lsu_mem_global_op_st.min         sector       17,763
    l1tex__t_sectors_pipe_lsu_mem_global_op_st.sum         sector      272,000
    sm__cycles_active.avg                                   cycle 2,354,948.86
    sm__cycles_active.max                                   cycle    2,416,241
    sm__cycles_active.min                                   cycle    2,333,394
    sm__cycles_active.sum                                   cycle   32,969,284
    sm__sass_thread_inst_executed_op_fp32_pred_on.avg        inst   353,035.71
    sm__sass_thread_inst_executed_op_fp32_pred_on.max        inst      415,170
    sm__sass_thread_inst_executed_op_fp32_pred_on.min        inst      316,320
    sm__sass_thread_inst_executed_op_fp32_pred_on.sum        inst    4,942,500
    ------------------------------------------------- ----------- ------------

  updateAtomsMMFFf4(int4, float4 *, float4 *, float4 *, float4 *, float4 *, int4 *, float4 *, float4 *, float4 *, float4 *, cu_Mat3 *, int *, float4 *) (3, 500, 1)x(32, 1, 1), Context 1, Stream 7, Device 0, CC 7.5
    Section: Command line profiler metrics
    ------------------------------------------------- ----------- ------------
    Metric Name                                       Metric Unit Metric Value
    ------------------------------------------------- ----------- ------------
    dram__bytes_read.avg                                    Mbyte         1.71
    dram__bytes_read.max                                    Mbyte         1.82
    dram__bytes_read.min                                    Mbyte         1.65
    dram__bytes_read.sum                                    Mbyte         6.84
    dram__bytes_write.avg                                   Mbyte         8.06
    dram__bytes_write.max                                   Mbyte         8.29
    dram__bytes_write.min                                   Mbyte         7.79
    dram__bytes_write.sum                                   Mbyte        32.25
    l1tex__t_sectors_pipe_lsu_mem_global_op_ld.avg         sector    12,505.86
    l1tex__t_sectors_pipe_lsu_mem_global_op_ld.max         sector       14,045
    l1tex__t_sectors_pipe_lsu_mem_global_op_ld.min         sector       10,745
    l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum         sector      175,082
    l1tex__t_sectors_pipe_lsu_mem_global_op_st.avg         sector     7,071.43
    l1tex__t_sectors_pipe_lsu_mem_global_op_st.max         sector        7,668
    l1tex__t_sectors_pipe_lsu_mem_global_op_st.min         sector        6,443
    l1tex__t_sectors_pipe_lsu_mem_global_op_st.sum         sector       99,000
    sm__cycles_active.avg                                   cycle 1,335,439.86
    sm__cycles_active.max                                   cycle    1,363,142
    sm__cycles_active.min                                   cycle    1,299,438
    sm__cycles_active.sum                                   cycle   18,696,158
    sm__sass_thread_inst_executed_op_fp32_pred_on.avg        inst   301,833.14
    sm__sass_thread_inst_executed_op_fp32_pred_on.max        inst      320,732
    sm__sass_thread_inst_executed_op_fp32_pred_on.min        inst      281,800
    sm__sass_thread_inst_executed_op_fp32_pred_on.sum        inst    4,225,664
    ------------------------------------------------- ----------- ------------

pyOpenCL MMFF

As a side-product for checking if CUDA implementation reproduces OpenCL results I also made simple pyOpenCL implementation of MMFF which is more convenient to work with and much simpler than out C++ version in OCL_MM.h and MolWorld_sp3_multi.h

It is in MolecularDynamics.py calling kernels from relax_multi_mini.cl.
- I also created general python class OpenCLBase.py which can automatically create python interfaces to kernels by parsing the '.cl' source code. This can be usefull in future for quick effortless prototyping other pyOpenCl apps (much simpler than in C/C++).
To initialize the system from .xyz file we use OCL/MMFF.py which reads AtomicSystem object generated by AtomicSystem.py. It can build bond topology, neighbors, electron pairs and pi-orbitals similarly as MMFFBuilder.h but fully in python (no need to compile any C/C++), which is more convenient for quick development and debugging. The types are loaded from the same datafiles AtomTypes.dat and ElementTypse.dat using MMparams.py