-
Notifications
You must be signed in to change notification settings - Fork 202
Improve GW performance efficiency with different tasks per node and layout #4141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Improve GW performance efficiency with different tasks per node and layout #4141
Conversation
…into feature/adjust_tasks_per_node_layout * 'develop' of github.com:NOAA-EPIC/global-workflow-cloud: Adding workflow infrastructure for wave_stat and wave_stat_pnt jobs (NOAA-EMC#3846) Add ush and parm of SPOC (NOAA-EMC#3944) Bugfixes from IMS C++ code additions (NOAA-EMC#4105) Increase C96 gcdas fcst wall time to 1.5 hours (NOAA-EMC#4108) Use default stochastic physics (NOAA-EMC#4073) Fix new feature template (NOAA-EMC#4104)
…into feature/adjust_tasks_per_node_layout * 'develop' of github.com:NOAA-EPIC/global-workflow-cloud: Update CODEOWNERS for new oceanice_products assignments (NOAA-EMC#4134) Improve GW forecast timing of high-resolution runs with findings from containerized GW (NOAA-EMC#4123) Update CODEOWNERS to change job assignments (NOAA-EMC#4128) Update j-jobs to be bash linter compliant (NOAA-EMC#4124) Update the marine analysis archiving (NOAA-EMC#4082) Port tracker/genesis tasks to cloud; load wgrib2; add noaacloud to load_gw_gsi_modules (NOAA-EMC#4111) Only run wave bnd pnt for gfs (NOAA-EMC#4112)
…into feature/adjust_tasks_per_node_layout * 'develop' of github.com:NOAA-EPIC/global-workflow-cloud: Correct parametric and ensemble background error statistics filenames in marine DA (NOAA-EMC#4120)
…into feature/adjust_tasks_per_node_layout * 'develop' of github.com:NOAA-EPIC/global-workflow-cloud: Ctest case updates (NOAA-EMC#4118) Consolidate load_*_modules scripts into a generic load_modules.sh script (NOAA-EMC#4126) Updates for Gaea C6 following OS upgrade (NOAA-EMC#4110)
c767c57
to
a8fb88c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I note that these changes are not exercised by running the global-workflow automated tests, and the resource changes should be reviewed by those routinely running at these resolutions.
case "${CASE}" in | ||
"C768") | ||
export tasks_per_node=144 | ||
export tasks_per_node=180 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GeorgeVandenberghe-NOAA had noted that the UFS did not run properly on C6 when allowed to use more than 144 tasks at high resolution. Is this issue fixed?
elif [[ "${RUN}" = "gfs" || "${RUN}" = "gcafs" ]]; then | ||
export layout_x=12 | ||
export layout_y=16 | ||
export layout_y=15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this in combination with the config.resources.<host>
changes so that the tiles land on the same nodes? Might want inputs from @dpsarmie on this change as this will affect all hosts. This would also need to be tested on WCOSS2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that some justification would be needed to have this change (on top of thorough testing since this seems to have only been tested on Ursa and Gaea).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
768/15 = 51.2 so I also think this means we can't be guaranteed reproducibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DavidHuber-NOAA Yes, that is the thought, so we hope that more communications can be within node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JessicaMeixner-NOAA From my understanding 768/15 = 51.2, which is not an integer, which should impact load balance, but not reproducibility. But with C768 is a pretty high-resolution, there are should be plenty of tasks to do, load-balance is probably not as important as at low-res. I guess reduce cross-node communication is more important.
From my understanding task per node impacts memory per task can use, and layout will mostly impact load balance (and probably cross cube-spheric tiles communication).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dpsarmie can you weigh in here. It’s my understanding the layout chosen here will not reproduce with other layouts since it’s not divisible. Maybe things have changed?
;; | ||
"C1152") | ||
export tasks_per_node=144 | ||
export tasks_per_node=160 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can someone share test logs for this change? I know after the OS update we still have to make manual changes to the number of write grid tasks for gaea specifically, so we'd need to confirm this is working before updating as this will impact GFS retro tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can not run C1152 on Ursa, which exceed the maximum numbers of nodes I can require.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is changing the layout on gaea. We shouldn’t change the layout without testing and showing that this will work.
no. It persists. FIrst issue is UPP memory use but there are other
memory exhausting components behind UPP and this is
going to require some work.
…On Thu, Oct 9, 2025 at 12:41 PM David Huber ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In dev/parm/config/gfs/config.resources.GAEAC6
<#4141 (comment)>
:
> @@ -10,10 +10,10 @@ case ${step} in
"fcst" | "efcs")
case "${CASE}" in
"C768")
- export tasks_per_node=144
+ export tasks_per_node=180
@GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA> had
noted that the UFS did not run properly on C6 when allowed to use more than
144 tasks at high resolution. Is this issue fixed?
—
Reply to this email directly, view it on GitHub
<#4141 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FXMCJMCN65VQSVWUA33W2F27AVCNFSM6AAAAACIX3UFUWVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTGMJZG44TAOJYGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
Description
Improve GW performance by adjust URSA, and GAEAC6 tasks per node, and px-py layout to gain short wall-clock time, or using less resource (nodes) with similar elapse time, or both.
Resolves #4140
Type of change
Change characteristics
How has this been tested?
Checklist