CodeFest Jlab January 2018 - lattice/quda GitHub Wiki

Different precision for halo and body (colorspinor::FieldOrder)
Add support for 8-bit fixed point in QUDA adding new QUDA_QUARTER_PRECSION
8-bit halos for smoother (combining above two)
Multi-right-hand sides MG setup for fine and coarse grids (bigger effect on coarse grids)
Add support for non-Hermitian chronological prediction
Investigate stability of chronological subspace evolution (over refinement issues seen on pure gauge?)
Try CG for null-space finding?

Memory reduction strategies:

thrust memory allocations don't seem to be routed through QUDA's allocators
remove fp32 null-space temporary during prolongator construction
use same smoother for pre and post
can chrono vectors be in single precision
run the GCR in half precision?

Copy gauge and copy gauge-kernels are not using fine grained parallelization and hence are running very slow. E.g., 2^4, Nc=24 copy ghost takes 4ms per direction on P100 vs. 10us for the coarse dslash. Applied fine-grained parallelization, and these kernels are running 10-30us - problem fixed!