Skip to content

Conversation

weihuang-jedi
Copy link
Contributor

Description

Improve GW performance by adjust URSA, and GAEAC6 tasks per node, and px-py layout to gain short wall-clock time, or using less resource (nodes) with similar elapse time, or both.

Resolves #4140

Type of change

  • Bug fix (fixes something broken)
  • New feature (adds functionality)
  • Maintenance (code refactor, clean-up, new CI test, etc.)
  • [ x] Performance improvement

Change characteristics

  • Is this a breaking change (a change in existing functionality)? NO
  • Does this change require a documentation update? NO
  • Does this change require an update to any of the following submodules? NO (If YES, please add a link to any PRs that are pending.)
    • EMC verif-global
    • GDAS
    • GFS-utils
    • GSI
    • GSI-monitor
    • GSI-utils
    • UFS-utils
    • UFS-weather-model
    • wxflow

How has this been tested?

  • Forecast-only on Ursa, GaeaC6

Checklist

  • [x ] Any dependent changes have been merged and published
  • [x ] My code follows the style guidelines of this project
  • [ x] I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have documented my code, including function, input, and output descriptions
  • [x ] My changes generate no new warnings
  • New and existing tests pass with my changes
  • This change is covered by an existing CI test or a new one has been added
  • Any new scripts have been added to the .github/CODEOWNERS file with owners
  • I have made corresponding changes to the system documentation if necessary

weihuang-jedi and others added 9 commits September 27, 2025 13:46
…into feature/adjust_tasks_per_node_layout

* 'develop' of github.com:NOAA-EPIC/global-workflow-cloud:
  Adding workflow infrastructure for wave_stat and wave_stat_pnt jobs (NOAA-EMC#3846)
  Add ush and parm of SPOC (NOAA-EMC#3944)
  Bugfixes from IMS C++ code additions (NOAA-EMC#4105)
  Increase C96 gcdas fcst wall time to 1.5 hours (NOAA-EMC#4108)
  Use default stochastic physics (NOAA-EMC#4073)
  Fix new feature template (NOAA-EMC#4104)
…into feature/adjust_tasks_per_node_layout

* 'develop' of github.com:NOAA-EPIC/global-workflow-cloud:
  Update CODEOWNERS for new oceanice_products assignments (NOAA-EMC#4134)
  Improve GW forecast timing of high-resolution runs with findings from containerized GW (NOAA-EMC#4123)
  Update CODEOWNERS to change job assignments (NOAA-EMC#4128)
  Update j-jobs to be bash linter compliant (NOAA-EMC#4124)
  Update the marine analysis archiving (NOAA-EMC#4082)
  Port tracker/genesis tasks to cloud; load wgrib2; add noaacloud to load_gw_gsi_modules (NOAA-EMC#4111)
  Only run wave bnd pnt for gfs (NOAA-EMC#4112)
…into feature/adjust_tasks_per_node_layout

* 'develop' of github.com:NOAA-EPIC/global-workflow-cloud:
  Correct parametric and ensemble background error statistics filenames in marine DA (NOAA-EMC#4120)
…into feature/adjust_tasks_per_node_layout

* 'develop' of github.com:NOAA-EPIC/global-workflow-cloud:
  Ctest case updates (NOAA-EMC#4118)
  Consolidate load_*_modules scripts into a generic load_modules.sh script (NOAA-EMC#4126)
  Updates for Gaea C6 following OS upgrade (NOAA-EMC#4110)
@weihuang-jedi weihuang-jedi force-pushed the feature/adjust_tasks_per_node_layout branch from c767c57 to a8fb88c Compare October 9, 2025 15:53
@weihuang-jedi weihuang-jedi marked this pull request as ready for review October 9, 2025 16:05
Copy link
Contributor

@aerorahul aerorahul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I note that these changes are not exercised by running the global-workflow automated tests, and the resource changes should be reviewed by those routinely running at these resolutions.

case "${CASE}" in
"C768")
export tasks_per_node=144
export tasks_per_node=180
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GeorgeVandenberghe-NOAA had noted that the UFS did not run properly on C6 when allowed to use more than 144 tasks at high resolution. Is this issue fixed?

elif [[ "${RUN}" = "gfs" || "${RUN}" = "gcafs" ]]; then
export layout_x=12
export layout_y=16
export layout_y=15
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this in combination with the config.resources.<host> changes so that the tiles land on the same nodes? Might want inputs from @dpsarmie on this change as this will affect all hosts. This would also need to be tested on WCOSS2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that some justification would be needed to have this change (on top of thorough testing since this seems to have only been tested on Ursa and Gaea).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

768/15 = 51.2 so I also think this means we can't be guaranteed reproducibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DavidHuber-NOAA Yes, that is the thought, so we hope that more communications can be within node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JessicaMeixner-NOAA From my understanding 768/15 = 51.2, which is not an integer, which should impact load balance, but not reproducibility. But with C768 is a pretty high-resolution, there are should be plenty of tasks to do, load-balance is probably not as important as at low-res. I guess reduce cross-node communication is more important.
From my understanding task per node impacts memory per task can use, and layout will mostly impact load balance (and probably cross cube-spheric tiles communication).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dpsarmie can you weigh in here. It’s my understanding the layout chosen here will not reproduce with other layouts since it’s not divisible. Maybe things have changed?

;;
"C1152")
export tasks_per_node=144
export tasks_per_node=160
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can someone share test logs for this change? I know after the OS update we still have to make manual changes to the number of write grid tasks for gaea specifically, so we'd need to confirm this is working before updating as this will impact GFS retro tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can not run C1152 on Ursa, which exceed the maximum numbers of nodes I can require.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is changing the layout on gaea. We shouldn’t change the layout without testing and showing that this will work.

@GeorgeVandenberghe-NOAA
Copy link

GeorgeVandenberghe-NOAA commented Oct 9, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve GW performance efficiency with different tasks per node and layout

6 participants