Skip to content

Conversation

guoqing-noaa
Copy link
Contributor

@guoqing-noaa guoqing-noaa commented Sep 18, 2025

Previously, MPASJEDI cannot write analysis into init.nc at cold-start cycles due to global attribute mismatch.

Thank @SamuelTrahanNOAA for his great work updating the MPAS-Model so that the MPAS I/O interface can ignore global attribute mismatch and allow writing DA analysis into init.nc directly. Check with @SamuelTrahanNOAA for more details on this and Sam's recent model changes can be found at: RRFSx/MPAS-Model#5

This PR is to incorporate Sam Trahan's model changes into rrfs-workflow and retire the previous temporary ncks method.

@guoqing-noaa
Copy link
Contributor Author

Convert to a draft while doing retro tests.

@guoqing-noaa
Copy link
Contributor Author

guoqing-noaa commented Oct 2, 2025

We just updated rrfs-workflow to make it work on newly OS-upgraded GaeaC6.
I did a quick test and found that the MPAS-Model changes crashed the mpassit step as the MPAS-Model skipped writing some required "global attributes" to diag.nc and history.nc files.

Here is the error message:

 - INIT 3D NZ+1 FIELD            2
 w
 - READ INPUT HIST DATA.

 FATAL ERROR: reading config_start_time: NetCDF: Attribute not found
 STOP.
MPICH ERROR [Rank 7] [job id 211291855.0] [Wed Oct  1 20:11:58 2025] [c6n1794] - Abort(999) (rank 7 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 999) - process 7

I ncdump'ed the diag.nc and history.nc files and they did miss the global attributes.

@SamuelTrahanNOAA FYI, skipping not-found attributes in MPAS-IO does has unexpected results here. We need to examine this issue further. Thanks!

@SamuelTrahanNOAA
Copy link
Contributor

Do you have sample files to look at?

I'm baffled by the absence of that variable. It should be in all the mpas history and diag files.

@guoqing-noaa
Copy link
Contributor Author

guoqing-noaa commented Oct 2, 2025

Do you have sample files to look at?

I'm baffled by the absence of that variable. It should be in all the mpas history and diag files.

Sorry, I tagged another sam.

Here are sample files:

/gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/sp193/stmp/20240506/rrfs_mpassit_00_v2.1.1/det/save/diag.2024-05-06_00.00.00.nc
/gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/sp193/stmp/20240506/rrfs_mpassit_00_v2.1.1/det/save/history.2024-05-06_00.00.00.nc

I think you should be able to repeat this by running a MPAS forecast using your update MPAS-Model.

@SamuelTrahanNOAA
Copy link
Contributor

I suspect I know the cause and solution, but I must investigate further to be certain.

Recall that I added code to only write an attribute if it was defined. That's because JEDI was writing MPAS files that lacked some attributes Jedi wanted to write. MPAS uses the archaic NetCDF 3 format, which cannot support defining attributes in data mode. My fix for that was to not write the attribute unless it was defined.

The MPAS may be using the same section of code to define attributes when NetCDF is in definition mode. In that case, my "fix" will prevent MPAS from defining attributes. To fix my fix, I need it to allow defining attributes when NetCDF is in definition mode.

@SamuelTrahanNOAA
Copy link
Contributor

The branch isn't compiling on GAEA. I have a potential fix, but I'm getting an error about a missing target. Is your branch pointing to the correct version of the MPAS-Model?

+ make -j8 ifort_icx CORE=atmosphere PRECISION=single
NOTE: PRECISION=single is unnecessary, single is the default
make: *** No rule to make target 'ifort_icx'.  Stop.

My clone is here:

/gpfs/f6/bil-fire10-oar/world-shared/Samuel.Trahan/coldstartDA

I compiled with this command in the sorc directory:

nohup ./build.all > build.log 2>&1 &

@guoqing-noaa
Copy link
Contributor Author

@SamuelTrahanNOAA I am sorry that when I updated this PR, I forgot to include the ifort_icx change. I will get this updated soon.

@guoqing-noaa
Copy link
Contributor Author

@SamuelTrahanNOAA Could you try the latest coldstartDA branch? Thanks!

@SamuelTrahanNOAA
Copy link
Contributor

I am attempting this now. The build hasn't finished yet, but I'll update you when I have something useful to report.

@MatthewPyle-NOAA MatthewPyle-NOAA changed the title Allow MPASJEDI write analysis into init.nc directly at cold-start cycles [rrfs-mpas-jedi] Allow MPASJEDI write analysis into init.nc directly at cold-start cycles Oct 3, 2025
@SamuelTrahanNOAA
Copy link
Contributor

This branch, even without my changes, segfaults.

There seems to be something wrong with "scalar 12." It starts at 0.468376E+09 and ends at NaN.

Log lines with `global min, max scalar 12`
  global min, max scalar 12 0.00000 0.468376E+09
  global min, max scalar 12 0.00000 0.931084E+09
  global min, max scalar 12 0.00000 0.139506E+10
  global min, max scalar 12 0.00000 0.185669E+10
  global min, max scalar 12 0.00000 0.231254E+10
  global min, max scalar 12 0.00000 0.276079E+10
  global min, max scalar 12 0.00000 0.301664E+10
  global min, max scalar 12 0.00000 0.342670E+10
  global min, max scalar 12 0.00000 0.383602E+10
  global min, max scalar 12 0.00000 0.424475E+10
  global min, max scalar 12 0.00000 0.465480E+10
  global min, max scalar 12 0.00000 0.250269E+10
  global min, max scalar 12 0.00000 0.251821E+10
  global min, max scalar 12 0.00000 0.252903E+10
  global min, max scalar 12 0.00000 0.253809E+10
  global min, max scalar 12 0.111000E+08 0.131899E+08
  global min, max scalar 12 0.111000E+08 0.132465E+08
  global min, max scalar 12 0.111000E+08 0.133032E+08
  global min, max scalar 12 0.111000E+08 0.133574E+08
  global min, max scalar 12 0.111000E+08 0.134094E+08
  global min, max scalar 12 NaN NaN
  global min, max scalar 12 NaN NaN
  global min, max scalar 12 NaN NaN
  global min, max scalar 12 NaN NaN
  global min, max scalar 12 NaN NaN
  global min, max scalar 12 NaN NaN
  global min, max scalar 12 NaN NaN
  global min, max scalar 12 NaN NaN
  global min, max scalar 12 NaN NaN
  global min, max scalar 12 NaN NaN

Here's a log from the failing job:

/gpfs/f6/bil-fire10-oar/scratch/Samuel.Trahan/hrly_12km/com/rrfs/v2.1.2/logs/rrfs.20240527/00/enkf/rrfs_fcst_m001_e12km_2024052700.log

This is where I built it:

/gpfs/f6/bil-fire10-oar/world-shared/Samuel.Trahan/unmodified-coldstartDA

@SamuelTrahanNOAA
Copy link
Contributor

SamuelTrahanNOAA commented Oct 3, 2025

Variable nwfa in init.nc contains a lot of NaNs and big numbers.

/gpfs/f6/bil-fire10-oar/scratch/Samuel.Trahan/hrly_12km/stmp/20240527/rrfs_prep_ic_00_v2.1.2/enkf/mem001/init.nc

EDIT: The job reads another file, which is a symlink to that one: /gpfs/f6/bil-fire10-oar/scratch/Samuel.Trahan/hrly_12km/stmp/20240527/rrfs_fcst_00_v2.1.2/enkf/mem001/fcst_00/init.nc

Some NaNs
( module use /gpfs/f6/bil-fire10-oar/world-shared/Samuel.Trahan/unmodified-coldstartDA/modulefiles/rrfs/ ; module load gaea.intel ;  ncdump  /gpfs/f6/bil-fire10-oar/scratch/Samuel.Trahan/hrly_12km/stmp/20240527/rrfs_fcst_00_v2.1.2/enkf/mem001/fcst_00/init.nc ) | grep -iE '=|nan'
... lots of stuff with no NaNs ...
 nwfa =
    -1.222968e+38, 0, NaNf, 0, 1.699064e-38, 0, 6.796257e-38, 0, 
    1.229614e+38, 0, NaNf, 0, -1.699064e-38, 0, -6.796257e-38, 0, 
    -7.685086e+36, 0, -3.074034e+37, 0, -1.229614e+38, 0, NaNf, 0, 
    1.23626e+38, 0, NaNf, 0, -1.708248e-38, 0, -6.832991e-38, 0, 
    -3.09065e+37, 0, -1.23626e+38, 0, NaNf, 0, 1.717431e-38, 0, 6.869726e-38, 
    1.942041e+36, 0, 7.768163e+36, 0, 3.107265e+37, 0, 1.242906e+38, 0, NaNf, 
    -3.107265e+37, 0, -1.242906e+38, 0, NaNf, 0, 1.726615e-38, 0, 
    3.123881e+37, 0, 1.249552e+38, 0, NaNf, 0, -1.726615e-38, 0, 
    NaNf, 0, 1.735798e-38, 0, 6.943194e-38, 0, 2.777278e-37, 0, 1.110911e-36, 
    3.140496e+37, 0, 1.256198e+38, 0, NaNf, 0, -1.735798e-38, 0, 
    -7.85124e+36, 0, -3.140496e+37, 0, -1.256198e+38, 0, NaNf, 0, 
    1.262844e+38, 0, NaNf, 0, -1.744982e-38, 0, -6.979928e-38, 0, 
    -7.892778e+36, 0, -3.157111e+37, 0, -1.262844e+38, 0, NaNf, 0, 
    7.934316e+36, 0, 3.173727e+37, 0, 1.269491e+38, 0, NaNf, 0, 
    -1.269491e+38, 0, NaNf, 0, 1.763349e-38, 0, 7.053397e-38, 0, 
    7.975855e+36, 0, 3.190342e+37, 0, 1.276137e+38, 0, NaNf, 0, 
    NaNf, 0, 1.772533e-38, 0, 7.090131e-38, 0, 2.836052e-37, 0, 1.134421e-36, 
    3.206957e+37, 0, 1.282783e+38, 0, NaNf, 0, -1.772533e-38, 0, 
... lots more stuff with NaNs ...

@SamuelTrahanNOAA
Copy link
Contributor

@guoqing-noaa Do you know why the init.nc contains gibberish in nwfa? It looks almost like uninitialized memory was written to the file.

@guoqing-noaa
Copy link
Contributor Author

@SamuelTrahanNOAA

This branch, even without my changes, segfaults.

In current rrfs-workflow, which has been used by lot of us and run correctly, the MPAS-Model hash is: fb92d7be
In this branch, the MPAS-Model hash is 7627288b
If we git diff fb92d7be 7627288b, we can see the differences are only on the following 3 files:

src/framework/mpas_io.F
src/framework/mpas_io_streams.F
src/framework/mpas_stream_manager.F

I think you may run 3DVAR using the GSIBEC, which does not work with spack-stack-1.9.x.
Could you try to run 3DEnVar by referring to this exp file?

/gpfs/f6/arfs-gsl/world-shared/gge/rrfs2/PR2rrfs-workflow/rrfs-workflow.20251001/workflow/exp.conus12km_gaeac6.arfs

Thanks!

@guoqing-noaa
Copy link
Contributor Author

@guoqing-noaa Do you know why the init.nc contains gibberish in nwfa? It looks almost like uninitialized memory was written to the file.

I think this is expected as GFS grib2 files does not have nwfa.

@guoqing-noaa
Copy link
Contributor Author

@SamuelTrahanNOAA I found that you have been running getkf.
At thos stage, I think we can focus on rrfsdet instead of rrfsenkf.
Thanks!

@SamuelTrahanNOAA
Copy link
Contributor

SamuelTrahanNOAA commented Oct 6, 2025

I reproduced the failure, but I cannot run the full system. It wants these files, which I don't have. When I point to your copy of these files, the workflow runs to the failure point:

Missing: /gpfs/f6/bil-fire10-oar/world-shared/Samuel.Trahan/OPSROOT/bec.gefs.2024050600-2024051223/rrfs.20240505/23/ic/enkf/mem030/init.nc

Solution: export HYB_ENS_PATH="/gpfs/f6/bil-fire10-oar/world-shared/gge/OPSROOT/bec.gefs.2024050600-2024051223"

I'll rerun with the fix to my fix and see if it works.

@guoqing-noaa
Copy link
Contributor Author

guoqing-noaa commented Oct 6, 2025

@SamuelTrahanNOAA Yes, please use:
export HYB_ENS_PATH="/gpfs/f6/bil-fire10-oar/world-shared/gge/OPSROOT/bec.gefs.2024050600-2024051223"

This is a staged copy of 30 ensembles which can be used by anyone so that we don't need to run the ensemble system by ourselves.

@SamuelTrahanNOAA
Copy link
Contributor

I have a fix that works for the mpas forecast. I'm retesting it on the full workflow with jedivar now.

@SamuelTrahanNOAA
Copy link
Contributor

Fix is here:

I'm still running the full workflow test.

@SamuelTrahanNOAA
Copy link
Contributor

I closed my MPAS-Model PR since it went to the wrong branch. I'm not sure what branch you're using for MPAS-Model. If you can tell me where, I'll PR to it. Otherwise, you'll find it here:

@SamuelTrahanNOAA
Copy link
Contributor

The workflow has proceeded further, with global attributes happily present, but now it is failing due to bad data in volg. That variable has NaNs and other nonsense values, but I see no problems in other variables. It's the first output time, so it's probably in the initial state.

Any thoughts?

/gpfs/f6/arfs-gsl/world-shared/Samuel.Trahan/rrfs2/PR2rrfs-workflow/sp193good/stmp/20240506/rrfs_mpassit_01_v2.1.2/det/mpassit_g01_01/history.2024-05-06_01.00.00.nc

Some NaNs

They're scattered about the file with other meaningless numbers like -5.283493e+35

 volg =
...
    -0.4502431, NaNf, 9.932466e+20, 1.059393e-28, -4.581919e-12, 
...
    5.228841e-17, 8.112756e-17, NaNf, 1.461707e-09, -1.776313e+12, 
...
    -3.226139e-24, NaNf, -2.84817e+35, -4.666101e+11, -3.071974e+22, 
...
    -9.491715e-35, 1.314693e-27, 1.714347e+34, NaNf,
...
    2.925704e-28, -1.397752e-31, -1134056, -1.877792e-29, 1755.415, NaNf, 
...
    NaNf, 3.100198e-09, 4.924234e-21,
...
    6.949755e-13, NaNf, -14.72541, 7.675564e-29, 8.862592e+24, 4.103332e-10, 
...

@guoqing-noaa
Copy link
Contributor Author

@SamuelTrahanNOAA volg should have meaningful values after the cold start forecast. You may compare it with a baseline experiment using the latest authoritative rrfs-workflow.

@SamuelTrahanNOAA
Copy link
Contributor

I checked init.nc, and volg is never initialized. Also, it is invalid in the time 0 output. I suspect nothing in the model ever writes to the volg variable, and it's outputting whatever happened to be in memory. If you used ncks to add volg to init.nc, it would retain that initial value. That's not a fix, it's a kludge. Either the model should write valid data to volg, or it shouldn't write volg at all. I might be wrong, though, and I won't know until I try a version of this workflow that actually works.

Can you tell me where I can find the best known working version of this workflow?

@SamuelTrahanNOAA
Copy link
Contributor

The MPAS-Model log file seems upset about the lack of volg. Apparently it was expected in the input file.

WARNING: Variable cellsOnCellsOnCell not in input file.
 Reading initial state from 'input' stream
WARNING: Variable volg not in input file.

@SamuelTrahanNOAA
Copy link
Contributor

I think the error is in your namelist.init_atmosphere where you disable initialization of the tempo scheme, despite using the tempo scheme in namelist.atmosphere:

&preproc_stages
    config_static_interp = false
    config_native_gwd_static = false
    config_native_gwd_gsl_static = false
    config_vertical_grid = true
    config_met_interp = true
    config_input_sst = false
    config_frac_seaice = true
    config_tempo_rap = false
/

Note the config_tempo_rap = false

@SamuelTrahanNOAA
Copy link
Contributor

Now that I've looked deeper, it appears the init_atmosphere doesn't even know how to initialize volg. It's possible to provide boundary conditions for it, but not initialize it. Inside the model, I don't think volg is initialized either. It remains invalid until after the first time the physics updates it. Developers probably thought this was okay since volg is only a diagnostic quantity. It appears MPAS can't handle uninitialized diagnostic quantities that are expected to be in the initial state.

@guoqing-noaa
Copy link
Contributor Author

guoqing-noaa commented Oct 8, 2025

Now that I've looked deeper, it appears the init_atmosphere doesn't even know how to initialize volg. It's possible to provide boundary conditions for it, but not initialize it. Inside the model, I don't think volg is initialized either. It remains invalid until after the first time the physics updates it. Developers probably thought this was okay since volg is only a diagnostic quantity. It appears MPAS can't handle uninitialized diagnostic quantities that are expected to be in the initial state.

@SamuelTrahanNOAA Thanks for digging into this. But this behavior is NOT limited to this PR. I think every GSL MPAS realtime and retro runs has the same issue. The latest rrfs-workflow.v2 does not init volg, but we don't get any crashes.

@haiqinli @barlage Do you have any inputs on this? Thanks!

@SamuelTrahanNOAA
Copy link
Contributor

The only reason why it would cause a crash is if MPASSIT tries to perform floating-point operations on the volg data within the MPAS output file.

@SamuelTrahanNOAA
Copy link
Contributor

This time it did fail. It appears the volg is still in histlist_3d. The obvious fix is to remove it.

zgrid                           PHB
w                               W
theta                           T
uReconstructZonal               U
uReconstructMeridional          V
qv                              QVAPOR
qc                              QCLOUD
qr                              QRAIN
qi                              QICE
qs                              QSNOW
qg                              QGRAUP
nc                              QNCLOUD
ni                              QNICE
nr                              QNRAIN
ng                              QNGRAUP
volg                            VOLG
nwfa                            QNWFA
nifa                            QNIFA
pressure                        P_HYD
rho                             MUB
cldfrac_bl                      CLDFRA_BL

@SamuelTrahanNOAA
Copy link
Contributor

In a meeting, we decided to modify MPAS-Model's Registry.xml to initialize volg. Unfortunately, the RDASApp fails to build if Registry.xml is changed in its copy of mpas. (There are three copies of mpas in the rrfs-workflow, and I updated all three.)

There's several errors about targets already existing, like this one:

CMake Error at mpas-jedi/src/tools/input_gen/CMakeLists.txt:5 (add_executable):
  add_executable cannot create target "mpas_streams_gen" because another
  target with the same name already exists.  The existing target is an
  executable created in source directory
  "/gpfs/f6/bil-fire10-oar/world-shared/Samuel.Trahan/bugfix/coldstartDA/sorc/RDASApp/bundle/MPAS/src/tools/input_gen".
  See documentation for policy CMP0002 for more details.

Also, there are some errors about libraries that don't exist.

CMake Error at mpas-jedi/src/tools/registry/CMakeLists.txt:13 (target_link_libraries):
  Attempt to add link library "parselib" to target "mpas_parse_atmosphere"
  which is not built in this directory.

  This is allowed only when policy CMP0079 is set to NEW.

All but one tell me to read about policy CMP0002. This is the one that doesn't.

CMake Error at /autofs/ncrc-svm1_proj/epic/spack-stack/c6/spack-stack-1.9.3/envs/ue-oneapi-2024.2.1/install/oneapi/2024.2.1/cmake-3.27.9-vhhq75e/share/cmake-3.27/Modules/FetchContent.cmake:1717 (message):
  Content mpas_data already populated in
  /gpfs/f6/bil-fire10-oar/world-shared/Samuel.Trahan/bugfix/coldstartDA/sorc/RDASApp/build/_deps/mpas_data-src
Call Stack (most recent call first):
  mpas-jedi/src/core_atmosphere/CMakeLists.txt:427 (FetchContent_Populate)

@SamuelTrahanNOAA
Copy link
Contributor

I tried reverting mpas and mpas-jedi, and explicitly setting cmake_policy(SET CMP0002 NEW), but it didn't fix anything.

@guoqing-noaa
Copy link
Contributor Author

@SamuelTrahanNOAA There is only one copy of MPAS-Model in rrfs-workflow, i.e. sorc/MPAS-Model.
build.rdas will clone that version and replace sorc/RDASApp/sorc/mpas.

The issue you got may be due to some of your local changes. I can update my branch to include the change in init Registry.xml for your test. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants