CodeFest Jlab January 2018 - lattice/quda GitHub Wiki

  • Different precision for halo and body (colorspinor::FieldOrder)
  • Add support for 8-bit fixed point in QUDA adding new QUDA_QUARTER_PRECSION
  • 8-bit halos for smoother (combining above two)
  • Multi-right-hand sides MG setup for fine and coarse grids (bigger effect on coarse grids)
  • Add support for non-Hermitian chronological prediction
  • Investigate stability of chronological subspace evolution (over refinement issues seen on pure gauge?)
  • Try CG for null-space finding?

Memory reduction strategies:

  • thrust memory allocations don't seem to be routed through QUDA's allocators
  • remove fp32 null-space temporary during prolongator construction
  • use same smoother for pre and post
  • can chrono vectors be in single precision
  • run the GCR in half precision?

Copy gauge and copy gauge-kernels are not using fine grained parallelization and hence are running very slow. E.g., 2^4, Nc=24 copy ghost takes 4ms per direction on P100 vs. 10us for the coarse dslash. Applied fine-grained parallelization, and these kernels are running 10-30us - problem fixed!