Performance tests - ProkopHapala/FireCore GitHub Wiki

Cost o functions on GPU

`__kernel scanNonBond2PBC`

scanNonBond2PBC() invR2           |   0.0314 [ns/op]  31.8045 [GOPS] | ntot:  68921000000 np:   1000 na:   1000 nPBC( 68921,[20, 20, 20]) time:   2.1670 [s]
scanNonBond2PBC() R2gauss         |   0.0203 [ns/op]  49.1580 [GOPS] | ntot:  68921000000 np:   1000 na:   1000 nPBC( 68921,[20, 20, 20]) time:   1.4020 [s]
scanNonBond2PBC() Morse_lin5      |   0.0371 [ns/op]  26.9293 [GOPS] | ntot:  68921000000 np:   1000 na:   1000 nPBC( 68921,[20, 20, 20]) time:   2.5593 [s]
scanNonBond2PBC() Morse_lin9      |   0.0396 [ns/op]  25.2603 [GOPS] | ntot:  68921000000 np:   1000 na:   1000 nPBC( 68921,[20, 20, 20]) time:   2.7284 [s]
scanNonBond2PBC() Morse_lin17     |   0.0485 [ns/op]  20.6346 [GOPS] | ntot:  68921000000 np:   1000 na:   1000 nPBC( 68921,[20, 20, 20]) time:   3.3401 [s]
scanNonBond2PBC() Morse_cub5      |   0.0407 [ns/op]  24.5827 [GOPS] | ntot:  68921000000 np:   1000 na:   1000 nPBC( 68921,[20, 20, 20]) time:   2.8036 [s]
scanNonBond2PBC() Morse           |   0.1559 [ns/op]   6.4149 [GOPS] | ntot:  68921000000 np:   1000 na:   1000 nPBC( 68921,[20, 20, 20]) time:  10.7439 [s]

`__kernel scanNonBond2PBC_2`

scanNonBond2PBC() invR2           |   0.0451 [ns/op]  22.1882 [GOPS] | ntot:  68921000000 np:   1000 na:   1000 nPBC( 68921,[20, 20, 20]) time:   3.1062 [s]
scanNonBond2PBC() R2gauss         |   0.0349 [ns/op]  28.6160 [GOPS] | ntot:  68921000000 np:   1000 na:   1000 nPBC( 68921,[20, 20, 20]) time:   2.4085 [s]
scanNonBond2PBC() Morse_lin5      |   0.0495 [ns/op]  20.1843 [GOPS] | ntot:  68921000000 np:   1000 na:   1000 nPBC( 68921,[20, 20, 20]) time:   3.4146 [s]
scanNonBond2PBC() Morse_lin9      |   0.0514 [ns/op]  19.4399 [GOPS] | ntot:  68921000000 np:   1000 na:   1000 nPBC( 68921,[20, 20, 20]) time:   3.5453 [s]
scanNonBond2PBC() Morse_lin17     |   0.0598 [ns/op]  16.7264 [GOPS] | ntot:  68921000000 np:   1000 na:   1000 nPBC( 68921,[20, 20, 20]) time:   4.1205 [s]
scanNonBond2PBC() Morse_cub5      |   0.0511 [ns/op]  19.5799 [GOPS] | ntot:  68921000000 np:   1000 na:   1000 nPBC( 68921,[20, 20, 20]) time:   3.5200 [s]
scanNonBond2PBC() Morse           |   0.1590 [ns/op]   6.2884 [GOPS] | ntot:  68921000000 np:   1000 na:   1000 nPBC( 68921,[20, 20, 20]) time:  10.9600 [s]

`__kernel scanNonBond2`

scanNonBond2() invR2           |   0.0028 [ns/op] 358.2453 [GOPS] | ntot: 100000000000 np: 100000 na: 1000000 time:   0.2791 [s]
scanNonBond2() R2gauss         |   0.0021 [ns/op] 474.9835 [GOPS] | ntot: 100000000000 np: 100000 na: 1000000 time:   0.2105 [s]
scanNonBond2() Morse_lin5      |   0.0032 [ns/op] 310.8248 [GOPS] | ntot: 100000000000 np: 100000 na: 1000000 time:   0.3217 [s]
scanNonBond2() Morse_lin9      |   0.0036 [ns/op] 277.6633 [GOPS] | ntot: 100000000000 np: 100000 na: 1000000 time:   0.3601 [s]
scanNonBond2() Morse_lin17     |   0.0037 [ns/op] 272.0255 [GOPS] | ntot: 100000000000 np: 100000 na: 1000000 time:   0.3676 [s]
scanNonBond2() Morse_cub5      |   0.0039 [ns/op] 253.7923 [GOPS] | ntot: 100000000000 np: 100000 na: 1000000 time:   0.3940 [s]
scanNonBond2() Morse           |   0.0065 [ns/op] 152.7715 [GOPS] | ntot: 100000000000 np: 100000 na: 1000000 time:   0.6546 [s]

scanNonBond2() invR2           |   0.0020 [ns/op] 494.0477 [GOPS] | ntot: 1000000000000 np: 1000000 na: 1000000 time:   2.0241 [s]
scanNonBond2() R2gauss         |   0.0017 [ns/op] 602.1384 [GOPS] | ntot: 1000000000000 np: 1000000 na: 1000000 time:   1.6607 [s]
scanNonBond2() Morse_lin5      |   0.0029 [ns/op] 350.3550 [GOPS] | ntot: 1000000000000 np: 1000000 na: 1000000 time:   2.8542 [s]
scanNonBond2() Morse_lin9      |   0.0032 [ns/op] 310.5692 [GOPS] | ntot: 1000000000000 np: 1000000 na: 1000000 time:   3.2199 [s]
scanNonBond2() Morse_lin17     |   0.0033 [ns/op] 304.6605 [GOPS] | ntot: 1000000000000 np: 1000000 na: 1000000 time:   3.2823 [s]
scanNonBond2() Morse_cub5      |   0.0034 [ns/op] 292.7462 [GOPS] | ntot: 1000000000000 np: 1000000 na: 1000000 time:   3.4159 [s]
scanNonBond2() Morse           |   0.0059 [ns/op] 170.7652 [GOPS] | ntot: 1000000000000 np: 1000000 na: 1000000 time:   5.8560 [s]

CPU single-core `MMFFsp3_loc.h` (commit)

CPU: 16 core, AMD Ryzen 7 5800X, 2200/4850 Mhz

Test 1: nHexadecan_dicarboxylic 50 atoms using MMFFsp3

command: ./MolGUIapp -x common_resources/nHexadecan_dicarboxylic -iParalel 0 -T 100 0.01 -verb 2 -perframe 2000

NOTES: run_no_omp() bPBC=0

bNonBondNeighs=0   22.612 us/iter    =  44224 iter/s
bNonBondNeighs=1   17.418 us/iter    =  57411 iter/s
no-NonBond          4.520 us/iter    = 221238 iter/s

Test 1b: nHexadecan_dicarboxylic 50 atoms using UFF

command: ./$name -x common_resources/nHexadecan_dicarboxylic -uff -iParalel 0 -T 100 0.01 -verb 2 -perframe 2000

bNonBondNeighs=0   26.90 us/iter    =  37174 iter/s
bNonBondNeighs=1   21.88 us/iter    =  45703 iter/s
no-NonBond          7.60 us/iter    = 131578 iter/s

Test 2: polymer-2_new PBC 45 atoms

command: ./MolGUIapp -x common_resources/polymer-2_new -g common_resources/NaCl_1x1_L2 -iParalel 0 -T 100 0.01 -verb 2 -perframe 500

NOTES: run_no_omp() bPBC=1 nPBC{1,1,0}; i.e. 3x3 = 9 images

bNonBondNeighs=0                          132.52 us/iter  =   7575 iter/s
bNonBondNeighs=1                          101.50 us/iter  =   9852 iter/s
no-NonBond(no GridFF)                       5.25 us/iter  = 190476 iter/s 
no-NonBond(+GridFF/triliner)                6.73 us/iter  = 148588 iter/s 
no-NonBond(+GridFF/triliner)(no termostat)  5.26 us/iter  = 190114 iter/s 
no-NonBond(+GridFF/tricubic)                9.58 us/iter  = 104384 iter/s

Test 2: polymer-2_new CPU GridFF::addForce() vs. GridFF::evalMorsePBC_sym()

MolGUIapp -x common_resources/polymer-2_new   -g common_resources/NaCl_1x1_L2   -Ftol 1e-12 -iParalel 0  -dt 0.05 -nogridff -perframe 100

MolWorld_sp3::run_no_omp(bGridFF=false)    236.08 [us/iter]  4.235k [iter/s]    190k [atoms/s]
MolWorld_sp3::run_no_omp(bGridFF=true )   1139.96 [us/iter]  877    [iter/s]     39k [atoms/s]

MolWorld_sp3::MDloop()  (bUFF=0,iParalel=0,bSurfAtoms=1,bGridFF=1,bPBC=1,bNonBonded=1bNonBondNeighs=0,dt=0.05,niter=100) time=23.6358[ms/100](236.083[us/iter] tick2second=2.62962e-10)
MolWorld_sp3::MDloop()  (bUFF=0,iParalel=0,bSurfAtoms=1,bGridFF=0,bPBC=1,bNonBonded=1bNonBondNeighs=0,dt=0.05,niter=100) time=113.861[ms/100](1139.96[us/iter] tick2second=2.62962e-10)

Test 2: polymer-2_new GPU GridFF() vs. getSurfMorse()

NVIDIA GeForce RTX 3090 24GB driver 535.161.08
NaCl substrate containing 10 atoms with 121 PBC images ( nPBC=(5,5,0) 1210 atoms total)
polymer-2_new 45 atoms
from 40 to 200 replicas in paralel
500 or 100 iterations per second
iParalle=3 i.e. MolWorld_sp3_multi::run_ocl_opt() MolGUIapp_multi -m 40 -x common_resources/polymer-2_new -g common_resources/NaCl_1x1_L2 -Ftol 1e-12 -perframe 500 MolGUIapp_multi -m 200 -x common_resources/polymer-2_new -g common_resources/NaCl_1x1_L2 -Ftol 1e-12 -perframe 100

Results

run_ocl_opt(bGridFF=true,nSys=40 ,perFrame=500)    86.595 [us/step]   476k [step/s]  20.78 mil. [atom/s]
run_ocl_opt(bGridFF=false,nSys=40 ,perFrame=500)  194.658 [us/step]   205k [step/s]   9.24 mil. [atom/s]
run_ocl_opt(bGridFF=true,nSys=200,perFrame=100)   114.278 [us/step]  1750  [step/s]  78.75 mil. [atom/s]
run_ocl_opt(bGridFF=false,nSys=200,perFrame=100)  249.81  [us/step]   800  [step/s]  36.02 mil. [atom/s]

run_ocl_opt(nSys=40|iPara=3,bSurfAtoms=1,bGridFF=1) NOT CONVERGED in 50 steps, |F|(7.20113e-05)>1e-12 time 4.1973 [ms]( 83.946 [us/step]) bGridFF=1 iSysFMax=0 dovdW=1 
run_ocl_opt(nSys=40|iPara=3,bSurfAtoms=1,bGridFF=0) NOT CONVERGED in 50 steps, |F|(7.23475e-05)>1e-12 time 9.76089 [ms]( 195.218 [us/step]) bGridFF=0 iSysFMax=0 dovdW=1 
run_ocl_opt(nSys=40|iPara=3,bSurfAtoms=1,bGridFF=1) NOT CONVERGED in 500 steps, |F|(9.12548e-05)>1e-12 time 43.2977 [ms]( 86.5954 [us/step]) bGridFF=1 iSysFMax=0 dovdW=1 
run_ocl_opt(nSys=40|iPara=3,bSurfAtoms=1,bGridFF=0) NOT CONVERGED in 500 steps, |F|(7.3448e-05)>1e-12 time 97.3289 [ms]( 194.658 [us/step]) bGridFF=0 iSysFMax=1 dovdW=1 
run_ocl_opt(nSys=200|iPara=3,bSurfAtoms=1,bGridFF=1) NOT CONVERGED in 100 steps, |F|(7.39509e-05)>1e-12 time 11.4278 [ms]( 114.278 [us/step]) bGridFF=1 iSysFMax=0 dovdW=1
run_ocl_opt(nSys=200|iPara=3,bSurfAtoms=1,bGridFF=0) NOT CONVERGED in 100 steps, |F|(0.000113862)>1e-12 time 24.981 [ms]( 249.81 [us/step]) bGridFF=0 iSysFMax=0 dovdW=1