ticket_298_TicketSummary - ACCESS-NRI/accessdev-Trac-archive GitHub Wiki

Assimilating Himawari-8/AHI clear-sky radiances in APS3 global suite

Aim

Single observation test - look at vertical and horizontal spread of a CSR radiance
Effect of AHI CSR radiances on 4DVAR convergence
"OSDP 6 - SatRad processing of geostationary clear-sky radiances" reports, a. "Improvement in analysed winds and humidities between 400 and 800 hPa has been observed." b. "how well it [analysis] is fitted by moisture-sensitive channels on other instruments: this has been found to be improved in more cases than not."

These 2 findings will be tested in our configuration

Experimental Set-up

Overview

Suites used

Following 2 suites were used,

Type	Suite	Summary of changes	Comment

Control	u-aj730		A copy of the standard APS3 global suite, u-ag312@24892 (equivalent to UKMO PS38 suite).

Note. At the particular revision when the suite was copied the trial period was 20160515T06 - 20160731T12. The data were available on marsdev and were copied to Raijin Bufr archive but since then they were cleaned out. On marsop AHI data are available from Oct 2017 | | Single-obs trial | u-aj977 | | a copy of u-aj730 with modifications to make OPS tasks for AHI CSR work; further modfications to assimilate only a single obs | | Longer trial | u-am137 | | a copy of u-aj730 with modifications to make OPS tasks for AHI CSR work; copy of u-aj977 at an earlier revision plus additional mods |

Note 1. u-ag312 differs from PS38 suite, u-ad365: see rose-suite.conf

Note 2. There was an error introduced to u-ag312, which propagated to all child suites including u-aj730 and u-aj977. The error was the use of PS37 initial VarBC file instead of PS38 one, which meant the coefficients for FY-3B and Himawari-8 were missing. The 2 suites were updated to use PS38 initial VarBC file

OPS build used

My development OPS branch is r3192_810_ahicsr_bom_bufr

Before making any changes I built this and did a quick test to see if it produces same Aircraft varobs. The Aircraft varobs files from this build is identical to the build, '/projects/access/nwpdir/share/APS3/OPS/ops-2016.03.0'.
See here for OPS code changes to make Ops_CreateODB correctly do the Bufr-to-ODB conversion of Bufr files received from MSC

Turning on gl_ops_process_ahiclear app and related apps

NCI optional app config created to replace data extraction from MetDB (used by UKMO) with a direct conversion of HIMCSR Bufr files to ODB databases.

Impact of a Single Observation

Selection of clear AHI CSR segments

To obtain complete information about each and every "FOV" (JMA/MSC uses the term, "segment" because each channel observation OPS processes is an aggregation of 16x16 pixels) SatRad NetCDF writefile was turned on.

To reduce the amount of data read in by OPS only a single split HIMCSR Bufr file for 20160515T06 was used.

From the writefile a single clear FOV was selected based on QCflags and its lat/lon noted,

idx=6860
lon[idx]=108.1577
lat[idx]=-28.04669

Then filtering was applied using "extractcontrolnl{ahiclear}" namelist of "gl_ops_process_ahiclear" app config by specifying geographic bounds in order to reduce the number of FOVs input to OPS to about a dozen. Following setting allowed 4 FOVs to be passed to varobs file,

[namelist:extractcontrolnl{ahiclear}(1)]
...
NorthBound=-28.0
SouthBound=-29.0
EastBound=109.0
WestBound=108.0

I refined the extractcontol namelist to allow a single FOV to be passed to varobs using the following,

[namelist:extractcontrolnl{ahiclear}(1)]
...
NorthBound=-28.04
SouthBound=-28.05
EastBound=108.16
WestBound=108.15

Next I modified channel selection file to allow only a single channel observation to be written out to varobs????

For comparison I did a single-observation test using IASI (see here)

Longer trial

I modified u-am137 to archive ODB2 and SatRad NetCDF writefiles for ahiclear.
Trial period is from 20160515T06 till 20160619T00 (5 weeks; 1 week of spin-up followed by 4 weeks of clean trial)

Results

Impact of a Single Observation

The vertical levels where the analysis increments are largest do not coincide with the peak heights of weighting. To better understand the relationship between the 2 try following experiments:

3DVAR - this eliminates the effect of PF and its adjoint
non-hybrid - this tests the spreading of observational information by static covariance
Try assimilating channel obs from IASI - weighting functions of IASI channels are sharper so may be easier to interpret

Note 1. ensemble covariance may not be used at higher levels in UKMO hybrid VAR

Longer trial

Results

For plots of some verification results see:

https://accessdev.nci.org.au/~jtl548/verpy/plots/da/sat_da/himawari8/ahiclear/obs_impact/long_trial/global/data.ln/verpy

Based on the limited verification it looks like the impact of H8/AHI CSR is neutral. This is not surprising as the number of FOV's passed to VAR is 6000-7000 and among those FOV's only a subset of IR bands are passed.

Analysis fit to observations is,

https://accessdev.nci.org.au/~jtl548/varstats/plots/da/sat_da/himawari8/ahiclear/obs_impact/long_trial/global/data.ln/varstats_ahiclear/plotsdir/cntrlu-aj730_experu-am137_glu_i-1_l0_S0_E-1_var

It appears the impact on analysis is neutral as well [ToDo. revisit the plots later].

Diary

u-aj730

Cycle time	Failed task	Reason for failure	Action taken

20160523T0000Z	glm_var_anal_n216	stdout and stderr
20160524T1200Z	glm_um_recon_em_n108, glm_um_recon_em_n216	My disk quota exceeded and the suite was in a strange state; when I reran the tasks ensemble forecast files from previous cycle were cleaned out and these tasks failed. Looking at the archive it looks as though all the cycles up to 20160524T18 ran and the current cycle is 20160525T00	Reset succeeded all cycles up to 20160524T18 and continued with 20160525T00
20160526T06	engl_ens_addssts_031	stdout has a message,

WARNING: Merging SST Perts with ETKF perts has not worked for member 031.

stderr has,

/home/548/jtl548/cylc-run/u-aj730/share/fcm_make_var-opt/build-serial/bin/VarScr_UMFileUtils: line 296: 1100: Bus error

| perturbed SST field seems to be in the analysis perturbation so triggered task again and it succeeds (problem with hardware?) | | 20160527T0000Z | glu_var_anal_n216 | Looking at stderr, some processes were interrupted while writing out analysis increment file: Var_WriteAnalPFUM.f90 -> Var_WriteModel.f90; while others Var_WriteAnalPFUM.f90 -> Var_PFexner.f90 -> Var_SwapBounds.f90 -> mpl_waitall_ | Reran the task and it succeeded | | 20160528T0000Z | engl_um_fcst_long_009 | It appears there was an MPI-related problem which forced the job to get stuck and then PBS killed the job as it exceeded walltime request. stderr has following trace: for one process, um_main.F90 -> um_shell.F90 -> gc_init_thread.F90 -> mpl_init_thread.F90; for another process, um_main.F90 -> um_shell.F90 -> um_config.F90 -> umprintmgr.F90 -> gc_ibcast.F90 -> mpl_bcast.F90 | Reran and the task worked | | 20160528T0000Z | engl_um_fcst_long_ss_009 | It looks like the UM history file from the long timestep was deleted. stderr has Cannot read history file /home/548/jtl548/cylc-run/u-aj730/share/cycle/20160528T0000Z/engl_um_009/engla.xhist | As engl_um_fcst_long_009 succeeded this task didn't need to run | | 20160528T1200Z and 20160528T1800Z | On average once per cycle engl_ens_addssts_* seem to be stuck in submitted state but the jobs are not in the queue. The 'qstat -f -x' command tells the jobs failed | unknown | Reset the task to failed state and then trigger | | 20160606T0000Z | glm_ops_bge_atmos | job.err has following message:

ERROR: task messaging failure. unsupported operand type(s) for +: 'NoneType' and 'str' | Reran the task and it completes successfully. Other tasks failed with identical message in job.err. Was there problem with PBS or software that handles messaging? | | 20160605T1800Z | engl_ens_smcperts | job.err has:

/home/548/jtl548/cylc-run/u-aj730/share/fcm_make_um_utils/utilities/bin/um-fieldcalc: line 338: 22520 Killed $fieldcalc_exec [FAIL] Problem with Fieldcalc program | Reran task and it succeeds. Cause of the failure unknown. | | 20160607T1200Z | engl_ver_hk_ard | stdout has:

Job 8677782.r-man2 killed due to exceeding jobfs quota. Quota: 200.0MB, Used: 255.92MB, Host: ![2760] | Reran task and it succeeds with the size of ARD_EG bigger than before | | 20160608T1200Z | I'm experiencing a number of what appears to be random failures with error message (stderr):

ERROR: task messaging failure. unsupported operand type(s) for +: 'NoneType' and 'str' Received signal ERR ERROR: task messaging failure. unsupported operand type(s) for +: 'NoneType' and 'str'

It turned out that I'm using CYLC_VERSION=7.1.0 and ROSE_VERSION=2017.05.0 (latest installed versions on Accessdev is 7.4.0 and 2017.05.0). Thinking that this mixing of versions is the cause of the problem I decided to start the suite using latest Cylc and Rose versions. | | Allowed all tasks of cycletime=20160608T12 to finish. Then warm-started the suite afresh from 20160608T18 | | 20160614T1800Z | engl_ens_smcperts | stderr has:

Job 9630791.r-man2 has exceeded memory allocation on node [38]

Received signal TERM

stdout has:

Job 9705892.r-man2 killed due to exceeding jobfs quota. Quota: 200.0MB, Used: 325.95MB, Host: [2759] | In PBS resource request jobfs requested is 200MB which in this case was exceeded. The PBS jobfs resource request was increased 500 MB (???? also in u-am137????) | | 20160617T12 | glu_ops_process_background_satwind and glm_ops_process_background_satwind | stderr has following message:

*** glibc detected *** /home/548/jtl548/cylc-run/u-am137/share/fcm_make_ops/build/bin/OpsProg_CreateODB.exe: double free or corruption (out): 0x00002ab50c2dd010 ***

This failure is same as the failure at the same cycle for u-am137. Repeated run seems to have same failure but occasionally other errors occur. Decided to run the glu_ops_process_background_satwind task for 20160617T18 to test and it succeeds. So it looks like there's a problem with the AMV Bufr files for the cycle, 20160617T12. N.B. it looks like Bufr-to-ODB conversion succeeded and there is an ODB for AMV's | Decided to reset glu_ops_process_background_satwind and glm_ops_process_background_satwind to succeeded and let the suite continue | | 20160617T12 | gl_ver_hk_ard | stdout has following message,

Job 9880142.r-man2 killed due to exceeding jobfs quota. Quota: 100.0MB, Used: 130.72MB, Host: [47] | In the family, [GL_VER_HK_FDB_AND_ARD] added PBS resource request for jobfs of 500 MB; also added the same to the family, [GL_VER_HK_FDB_AND_ARD_EC] | | 20160618T1800Z | engl_ens_smcperts | stderr has:

Error from routine: portio2a:flush_unit_buffer

*** Fatal error; aborting (SIGABRT) ...

stderr has:

| Reset task to succeeded and let the suite continue |

u-am137

Cycle time	Failed task	Reason for failure	Action taken

20160516T18	glu_var_anal_n108	stderr log says `PF_bdy_lyr.f90` failed; at `niter= 25` something went wrong; it appears for this cycle mu seems negative more often than other cycles during inner-loop iteration but not sure if this is the cause of the failure	reran task and it succeeded; compared stdout log outputs from previous, failed job and from the successful run and the numbers are exactly same until niter=25
20160605T1200Z	engl_ens_addssts_014	No obvious message in stdout/stderr; perturbed SST field seems to have been merged with analysis perturbation	Reset task to succeeded
20160614T0000Z and 20160614T1200Z	gl_ver_obs_satwind	stderr has following lines:

/home/548/jtl548/cylc-run/u-am137/share/fcm_make_ver/build/bin/VerScr_VerifVsObs: line 443: 24422: Memory fault VerScr_VerifVsObs: VerProg_VerifVsObs.exe failed with rc 267 | It turns out that the default size of available stack is small on Raijin's compute nodes. So occasionally when obstore files are slightly bigger than usual VerProg_VerifVsObs.exe runs out of stack. The workaround is that in VerScr_VerifVsObs of VER source I added 'ulimit -s unlimited'. See https://code.metoffice.gov.uk/trac/ver/ticket/25 | | 20160612T1200Z - 20160614T1200Z | All OPS tasks to do with ahiclear failed to run. Even more strange, they do not appear on gcylc (!) May need to start from 20160612T06 to generate warm-runing files for the analysis at 20160612T12.

N.B. it looks like during testing of FASTRUN I have inadvertently overwritten atmanl files from 20160612T00 and 20160612T06 | | FASTRUN using atmanl file is more involved than I thought. ~~Here's more details about how to modify glu_um_fcst task to do FASTRUN using atmanl.~~ As FASTRUN using atmanl file requires changes to time profile of output STASH I decided not to rerun from earlier cycle. Instead I started from the cycle when enough warm-running files are available: 20160614T12 seems to have enough warm-running background files for 20160614T18 so I warm-started the suite from 20160614T18. span(style=color:#FF0000, So AHI CSR is not used for cycles 20160612T1200Z - 20160614T1200Z) | | 20160617T12 | gl[mu]_ops_process_background_satwind | stderr has following message:

*** glibc detected *** /home/548/jtl548/cylc-run/u-am137/share/fcm_make_ops/build/bin/OpsProg_CreateODB.exe: double free or corruption (out): 0x00002ab50c2dd010 *** | It looks like there's a problem with the AMV Bufr files (see diary for u-aj730). Reset gl[mu]_ops_process_background_satwind tasks to succeeded and let the suite continue | | 20160717T12 | gl_ver_obs_satwind | stderr has:

Zero observations found in ODB: "/home/548/jtl548/cylc-run/u-am137/share/data/ver/user/ODB_GM/ODB_20160617T1200Z_Satwind.obstore" | Decided to reset the task to succeeded and let the suite proceed | | 20160618T18 | glu_ops_odb_to_odb2_satwind | See same failure in u-aj730 | See above |

Useful information

In the suite the OPS tasks for H-8 AHI CSR use the label, 'ahiclear'
In the OPS source code the obsgroup used for H-8 AHI CSR is 'ObsGroupAHIClr' - see 'OpsMod_ObsInfo/OpsMod_ObsGroupInfo.f90'
In the OPS source code the MetDB subtype used for AHI CSR is 'HIMCSR' - see '../public/Ops_Constants/Ops_SubTypeNameToNum.inc'

ToDo

In raijin4:/g/data/dp9/da/access-g/ops/bufr add "ahiclear.*.bufr" to ECMA tarballs
In "gl[um]_ops_process_background_ahiclear" tasks may need to modify MetDB elements file
- the repetition may not match what's in ODB