-
Notifications
You must be signed in to change notification settings - Fork 202
Improve GW performance efficiency with different tasks per node and layout #4141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Changes from all commits
f613832
d1bf01e
787df18
2f70d7c
958ae08
2a1f9de
95ad6f2
589c7d0
a8fb88c
ef967a3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,10 +10,10 @@ case ${step} in | |
"fcst" | "efcs") | ||
case "${CASE}" in | ||
"C768") | ||
export tasks_per_node=144 | ||
export tasks_per_node=180 | ||
;; | ||
"C1152") | ||
export tasks_per_node=144 | ||
export tasks_per_node=160 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can someone share test logs for this change? I know after the OS update we still have to make manual changes to the number of write grid tasks for gaea specifically, so we'd need to confirm this is working before updating as this will impact GFS retro tests. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can not run C1152 on Ursa, which exceed the maximum numbers of nodes I can require. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is changing the layout on gaea. We shouldn’t change the layout without testing and showing that this will work. |
||
;; | ||
*) | ||
# Nothing to do for other resolutions | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -271,7 +271,7 @@ case "${fv3_res}" in | |
export WRTTASK_PER_GROUP_PER_THREAD_PER_TILE=15 #Note this should be 10 for WCOSS2 | ||
elif [[ "${RUN}" = "gfs" || "${RUN}" = "gcafs" ]]; then | ||
export layout_x=12 | ||
export layout_y=16 | ||
export layout_y=15 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this in combination with the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that some justification would be needed to have this change (on top of thorough testing since this seems to have only been tested on Ursa and Gaea). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 768/15 = 51.2 so I also think this means we can't be guaranteed reproducibility. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @DavidHuber-NOAA Yes, that is the thought, so we hope that more communications can be within node. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @JessicaMeixner-NOAA From my understanding 768/15 = 51.2, which is not an integer, which should impact load balance, but not reproducibility. But with C768 is a pretty high-resolution, there are should be plenty of tasks to do, load-balance is probably not as important as at low-res. I guess reduce cross-node communication is more important. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @dpsarmie can you weigh in here. It’s my understanding the layout chosen here will not reproduce with other layouts since it’s not divisible. Maybe things have changed? |
||
export WRITE_GROUP=4 | ||
export WRTTASK_PER_GROUP_PER_THREAD_PER_TILE=20 #Note this should be 10 for WCOSS2 | ||
fi | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GeorgeVandenberghe-NOAA had noted that the UFS did not run properly on C6 when allowed to use more than 144 tasks at high resolution. Is this issue fixed?