Lattice Optimization OCL OMP - ProkopHapala/FireCore GitHub Wiki

Computer in the office

  • Intel(R) Core(TM) i5-10400F CPU @ 2.90GHz (11/12 cores used)
  • NVIDIA GeForce RTX 3060 OpenCL1.2 COMPUTE_UNITS:28 @1777MHz GLOBAL_MEM:12019 MB LOCAL_MEM:48kB

Interactive GUI:

pulling atoms, natoms 45 nnode 24 ncap 21 npi 24 nPBC{1,1,0}

Build-dbg

run_ocl_opt     (nSys=40|iPara=2)  NOT CONVERGED in 50/50 steps |F|(0.000400031)>1e-06 time=  4.8674[ms]    97.34 [us/step] bGridFF=1 iSysFMax=0 dovdW=1
run_omp_ocl     (nSys=40|iPara=1)  NOT CONVERGED in 50/50 steps |F|=6.10705e-05        time= 10.9407[ms]   218.81 [us/step]
run_omp_ocl     (nSys=40|iPara=0)  NOT CONVERGED in 50/50 steps |F|=0.0121325          time= 82.8864[ms]  1657.73 [us/step]
run_multi_serial(nSys=40|iPara=-1) NOT CONVERGED in 50/50 steps |F|=0.165467           time= 436.057[ms]  8721.15 [us/step]

Build-opt

solver	                          natom  perFrame  nSys  [ms]  [us/step]  nstep/sec  nstep*nSys/s  nstep*natom/sec
--------------------------------------------------------------------------------------------------------------------
run_ocl_opt     (nSys=40|iPara=2)   45     50      40    4.848     96.97  10,312.5    412,499      18,562,442
run_omp_ocl     (nSys=40|iPara=1)   45     50      40    8.885    177.71   5,627.1    225,086      10,128,862
run_omp_ocl     (nSys=40|iPara=0)   45     50      40   83.129  1,662.59     601.5     24,057       1,082,648
run_multi_serial(nSys=40|iPara=-1)  45     50      40  438.562  8,771.24     114.0      4,560         205,216
run_ocl_opt     (nSys=40|iPara=2)  NOT CONVERGED in 50/50 steps, |F|(1.65009)>1e-06  time    4.848[ms]     96.97 [us/step] bGridFF=1 iSysFMax=0 dovdW=1
run_omp_ocl     (nSys=40|iPara=1)  NOT CONVERGED in 50/50 steps  |F|=0.424407        time=   8.885[ms]    177.71 [us/step]
run_omp_ocl     (nSys=40|iPara=0)  NOT CONVERGED in 50/50 nsteps |F|=0.241127        time=  83.129[ms]   1662.59 [us/step]
run_multi_serial(nSys=40|iPara=-1) NOT CONVERGED in 50/50 nsteps |F|=0.0544194       time= 438.562[ms]   8771.24 [us/step]

Lattice Optimization

Laptop at home

  • Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz (7/8 cores used)
  • NVIDIA GeForce GTX 960M OpenCL1.2 5Compute-units @ 1176MHz GLOBAL_MEM 4046MB LOCAL_MEM 48kB

Interactive GUI:

pulling atoms, natoms 45 nnode 24 ncap 21 npi 24 nPBC{1,1,0}

run_multi_serial(nSys=40|iPara=-1) NOT CONVERGED in 50/50 nsteps |F|=0.507869 time=367.992[ms] 7359.84[us/step]
run_omp_ocl(nSys=40|iPara=0) NOT CONVERGED in 50/50 nsteps |F|=0.109739 time=107.339[ms] 2146.78[us/step]
run_omp_ocl(nSys=40|iPara=1) NOT CONVERGED in 50/50 nsteps |F|=0.0744521 time=16.7628[ms] 335.256[us/step]
run_ocl_opt(nSys=40|iPara=2) NOT CONVERGED in 50 steps, |F|(0.0639756)>1e-06 time 14.3617[ms] 287.234[us/step] bGridFF=1 iSysFMax=0 dovdW=1
getBuffs(): nSys 40 nDOFs 207 nvecs 69  natoms 45 nnode 24 ncap 21 npi 24 nPBC{1,1,0}

mmff.run(10000,iParalel=-1)
run_multi_serial(nSys=40|iPara=2) CONVERGED in 1654/10000 nsteps |F|=0.000972228 time=31521.1[ms] 19057.5[us/step]
Py: time(optimizeLattice_1d) 34.6389[s]

mmff.run(10000,iParalel=0)
run_omp_ocl(nSys=40|iPara=2) CONVERGED in 1654/10000 nsteps |F|=0.000972228 time=10314.9[ms] 6236.34[us/step] 
Py: time(optimizeLattice_1d) 10.426[s]

mmff.run(10000,iParalel=1)
run_omp_ocl(nSys=40|iPara=2) CONVERGED in 1797/10000 nsteps |F|=0.000918641 time=1263.58[ms] 703.163[us/step]  
Py: time(optimizeLattice_1d) 1.1521[s]

mmff.run(10000,iParalel=2)
run_ocl_opt(nSys=40|iPara=2) CONVERGED in <3730 steps, |F|(0.000996397)<0.001 time 2714.56[ms] 727.763[us/step] bGridFF=1 
Py: time(optimizeLattice_1d) 4.76416[s]

run_ocl_opt(nSys=40|iPara=2) NOT CONVERGED in 50 steps, |F|(0.000474281)>1e-06 time 14.3493[ms] 286.986[us/step] bGridFF=1 iSysFMax=0 dovdW=1
getBuffs(): nSys 10 nDOFs 207 nvecs 69  natoms 45 nnode 24 ncap 21 npi 24 nPBC{1,1,0}

mmff.run(10000,iParalel=-1)
rum_multi_serial(bOcl=0) CONVERGED in 1654/10000 nsteps |F|=0.000972228 time=7654.02[ms]
Py: time(optimizeLattice_1d) 8.53864[s]


mmff.run(10000,iParalel=0)
rum_omp_ocl(bOcl=0) CONVERGED in 1654/10000 nsteps |F|=0.000972228 time=3878.08[ms]
Py: time(optimizeLattice_1d) 3.6377[s]

mmff.run(10000,iParalel=1)
rum_omp_ocl(bOcl=1) CONVERGED in 1797/10000 nsteps |F|=0.000918641 time=897.201[ms]
Py: time(optimizeLattice_1d) 0.811637[s]

Dependence on local_size in kernel getNonBond():

for laptob GPU NVIDIA GeForce GTX 960M, 50 iterations of MolWorld_sp3_multi::run_ocl_opt() with polymer-new.xyz (45 atoms):

nloc=1   119.535 [ms]
nloc=2    64.366 [ms]
nloc=4    38.498 [ms]
nloc=8    24.559 [ms]
nloc=16   15.665 [ms]
nloc=32   13.665 [ms]
nloc=64   13.800 [ms]