GPU Profiling - ProkopHapala/FireCore GitHub Wiki
CUDA / NVIDIA Nsight
-
I made CUDA reimplementation of main kernels for MMFF simulation in branch
prokop
- /cl/relax_multi.cu is the CUDA code
- cuMMFF_lib.cpp is a simple shared library which expose calls to CUDA kernels as
extern "C"
function calls, and allows to manage CUDA state (crerate/upload/download buffers ...) - cuMMFF.py is python ctypes binding to
cuMMFF_lib.cpp
- test_mmff_cuda_vs_ocl.py test script which can run both CUDA and pyOpenCL implementation of MMFF
-
How to run NVIDIA Nsight on
test_mmff_cuda_vs_ocl.py
- see script run_nvsight.sh
- I just follow instructions from perplexity on how to use it see
- To be honest, I tested it on my very old and weak laptop GPU
GTX 1650
, which lack a lot of features. For our better GPUs there should be much better--metrics
available
- To be honest, I tested it on my very old and weak laptop GPU
- I can run visual Nsight application, but I did not spent any time figuring out how to use it
The output form run_nvsight.sh
looks like this:
################# RUN CUDA MMFF #################
CUDA MD time: 1.5100074450001557
CU_MM object created (Buffer Map version).
CU_MM::init(nSys=500, nAtoms=50, nNode=16, npbc=1, nMaxSysN=4)
CU_MM::init() finished successfully (Buffer Map version).
ff.synchronize()
cuMMFF_lib.cpp run_MD(): nstep 10
==PROF== Disconnected from process 8274
CU_MM::cleanup() freeing 24 GPU buffers...
CU_MM::cleanup() finished.
CU_MM object destroyed (Buffer Map version).
[8274] [email protected]
getMMFFf4(int4, float4 *, float4 *, float4 *, int4 *, int4 *, float4 *, float4 *, float4 *, float4 *, float4 *, float4 *, cu_Mat3 *, cu_Mat3 *, float4 *, int, int) (1, 500, 1)x(32, 1, 1), Context 1, Stream 7, Device 0, CC 7.5
Section: Command line profiler metrics
------------------------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
------------------------------------------------- ----------- ------------
dram__bytes_read.avg Mbyte 3.58
dram__bytes_read.max Mbyte 3.63
dram__bytes_read.min Mbyte 3.51
dram__bytes_read.sum Mbyte 14.33
dram__bytes_write.avg Mbyte 7.79
dram__bytes_write.max Mbyte 7.97
dram__bytes_write.min Mbyte 7.64
dram__bytes_write.sum Mbyte 31.15
l1tex__t_sectors_pipe_lsu_mem_global_op_ld.avg sector 9,377.21
l1tex__t_sectors_pipe_lsu_mem_global_op_ld.max sector 11,291
l1tex__t_sectors_pipe_lsu_mem_global_op_ld.min sector 8,654
l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum sector 131,281
l1tex__t_sectors_pipe_lsu_mem_global_op_st.avg sector 19,428.57
l1tex__t_sectors_pipe_lsu_mem_global_op_st.max sector 23,392
l1tex__t_sectors_pipe_lsu_mem_global_op_st.min sector 17,763
l1tex__t_sectors_pipe_lsu_mem_global_op_st.sum sector 272,000
sm__cycles_active.avg cycle 2,354,948.86
sm__cycles_active.max cycle 2,416,241
sm__cycles_active.min cycle 2,333,394
sm__cycles_active.sum cycle 32,969,284
sm__sass_thread_inst_executed_op_fp32_pred_on.avg inst 353,035.71
sm__sass_thread_inst_executed_op_fp32_pred_on.max inst 415,170
sm__sass_thread_inst_executed_op_fp32_pred_on.min inst 316,320
sm__sass_thread_inst_executed_op_fp32_pred_on.sum inst 4,942,500
------------------------------------------------- ----------- ------------
updateAtomsMMFFf4(int4, float4 *, float4 *, float4 *, float4 *, float4 *, int4 *, float4 *, float4 *, float4 *, float4 *, cu_Mat3 *, int *, float4 *) (3, 500, 1)x(32, 1, 1), Context 1, Stream 7, Device 0, CC 7.5
Section: Command line profiler metrics
------------------------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
------------------------------------------------- ----------- ------------
dram__bytes_read.avg Mbyte 1.71
dram__bytes_read.max Mbyte 1.82
dram__bytes_read.min Mbyte 1.65
dram__bytes_read.sum Mbyte 6.84
dram__bytes_write.avg Mbyte 8.06
dram__bytes_write.max Mbyte 8.29
dram__bytes_write.min Mbyte 7.79
dram__bytes_write.sum Mbyte 32.25
l1tex__t_sectors_pipe_lsu_mem_global_op_ld.avg sector 12,505.86
l1tex__t_sectors_pipe_lsu_mem_global_op_ld.max sector 14,045
l1tex__t_sectors_pipe_lsu_mem_global_op_ld.min sector 10,745
l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum sector 175,082
l1tex__t_sectors_pipe_lsu_mem_global_op_st.avg sector 7,071.43
l1tex__t_sectors_pipe_lsu_mem_global_op_st.max sector 7,668
l1tex__t_sectors_pipe_lsu_mem_global_op_st.min sector 6,443
l1tex__t_sectors_pipe_lsu_mem_global_op_st.sum sector 99,000
sm__cycles_active.avg cycle 1,335,439.86
sm__cycles_active.max cycle 1,363,142
sm__cycles_active.min cycle 1,299,438
sm__cycles_active.sum cycle 18,696,158
sm__sass_thread_inst_executed_op_fp32_pred_on.avg inst 301,833.14
sm__sass_thread_inst_executed_op_fp32_pred_on.max inst 320,732
sm__sass_thread_inst_executed_op_fp32_pred_on.min inst 281,800
sm__sass_thread_inst_executed_op_fp32_pred_on.sum inst 4,225,664
------------------------------------------------- ----------- ------------
pyOpenCL MMFF
As a side-product for checking if CUDA implementation reproduces OpenCL results I also made simple pyOpenCL implementation of MMFF which is more convenient to work with and much simpler than out C++ version in OCL_MM.h and MolWorld_sp3_multi.h
- It is in MolecularDynamics.py calling kernels from relax_multi_mini.cl.
- I also created general python class OpenCLBase.py which can automatically create python interfaces to kernels by parsing the '.cl' source code. This can be usefull in future for quick effortless prototyping other pyOpenCl apps (much simpler than in C/C++).
- To initialize the system from .xyz file we use OCL/MMFF.py which reads
AtomicSystem
object generated by AtomicSystem.py. It can build bond topology, neighbors, electron pairs and pi-orbitals similarly as MMFFBuilder.h but fully in python (no need to compile any C/C++), which is more convenient for quick development and debugging. The types are loaded from the same datafilesAtomTypes.dat
andElementTypse.dat
using MMparams.py