MDWF solver iteration count - callat-qcd/chroma GitHub Wiki

We observed that on the a15m135XL (or a12m130) ensemble, a newer version of Chroma took more iterations to achieve the same solution as compared to an older version of Chroma.

  1. We should check if the issue still persists.
  2. If it does, we should "bifurcate" and find what Chroma commit it happens in

Chroma Quda Solution check

using the UNPRECONDITIONED_NEF

  • `/usr/workspace/coldqcd/software/lassen_smpi_RR2/install/chroma_callat_gpu/bin/chroma
  • xml/48c64_weakfield.mat.ini.xml

randomly, the solution check says that s=0 source is 0 and s=L5-1 is non zero

| r[0] | = 2.41868299013787e-05 | b[0] | = 0 |r|/|b|[0] = inf
...
| r[23] | = 0.00288527063196605 | b[23] | = 3.50324065066468

3 March

Henry and I tried all 3 versions of the code: mdwf_lanczos with MAT solution, mdwf_production, chroma, and all three have Chroma fail the solution check (the reported residual is O(0.5) instead of the requested 5e-4 using the test input file

/usr/workspace/coldqcd/c51/x_files/project_2/test_mdwf/solver_iter/xml/a15m135XL_prop.fast.new.ini.xml

The plan is to get an OLD Chroma

commit 825ba0360fe0c9f96903140d91df58859c78fe09
Author: Balint Joo <[email protected]>
Date:   Wed Jan 3 10:06:06 2018 -0500

and find a QUDA from that time, and run the test, and try and figure out when the test starts failing.

Lassen Installation

We can check on Lassen. Our software installation is here

/usr/workspace/coldqcd/software/callat_build_scripts

The routine that compiles our software stack is

build_lassen_smpi_RR2.sh

where one sees that stack_scripts/./build_chroma_quda.sh is the file that builds our chroma. It uses the variable $chroma_src, which is defined in env.sh, chroma_src=chroma_callat. If we go to the src

/usr/workspace/coldqcd/software/src/chroma_callat

and we are on the mdwf_lanczos branch, which is brached off of mdwf_production.

What should we do?

  1. compile chroma_callat_ah from mdwf_production branch
  2. compile chroma_usqcd from the branch we think we forked from, 825ba0360fe0c9f96903140d91df58859c78fe09
  3. compile latest chroma

With all 3, run a single source prop solve on a12m130 (or a15m135XL) and check that the solution is the same, and also how many iterations to get there.

notes

To compile our chroma, we had to change mg_param.vec_infile[0] = '\0'; to mg_param.vec_infile[0][0] = '\0'; in 3 files, and similarly for mg_param.vec_outfile[0] = '\0';

lib/actions/ferm/invert/quda_solvers/syssolver_linop_clover_quda_multigrid_w.h
lib/actions/ferm/invert/quda_solvers/syssolver_linop_wilson_quda_multigrid_w.h
lib/actions/ferm/invert/quda_solvers/quda_mg_utils.h