Standalone score compute #433

pgmpablo157321 · 2025-08-28T14:44:02Z

github-actions · 2025-08-28T14:44:15Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

ShriyaRishab · 2025-09-04T14:57:10Z

I tested it locally on a few results

With scaling.json -

$ python3 -m mlperf_logging.result_summarizer.compute_score --benchmark llama31_405b  --system tyche_ngpu512_ngc25.04_nemo --benchmark_folder /training_results_v5.0/NVIDIA/results/tyche_ngpu512_ngc25.04_nemo/llama31_405b --usage training --ruleset 5.0.0 --scale
NOTICE: Applying scaling factor 1.1538461538461537 to dir /training_results_v5.0/NVIDIA/results/tyche_ngpu512_ngc25.04_nemo/llama31_405b
MLPerf training
Folder: /training_results_v5.0/NVIDIA/results/tyche_ngpu512_ngc25.04_nemo/llama31_405b
Version: 5.0.0
System: tyche_ngpu512_ngc25.04_nemo
Benchmark: llama31_405b
Score - Time to Train (minutes): 121.7573269230769

Without --scale but scaling.json file still exists from previous run in the folder -

$ python3 -m mlperf_logging.result_summarizer.compute_score --benchmark llama31_405b  --system tyche_ngpu512_ngc25.04_nemo --benchmark_folder /training_results_v5.0/NVIDIA/results/tyche_ngpu512_ngc25.04_nemo/llama31_405b --usage training --ruleset 5.0.0
NOTICE: Applying scaling factor 1.1538461538461537 to dir /training_results_v5.0/NVIDIA/results/tyche_ngpu512_ngc25.04_nemo/llama31_405b
MLPerf training
Folder: /training_results_v5.0/NVIDIA/results/tyche_ngpu512_ngc25.04_nemo/llama31_405b
Version: 5.0.0
System: tyche_ngpu512_ngc25.04_nemo
Benchmark: llama31_405b
Score - Time to Train (minutes): 121.7573269230769

After manually deleting scaling.json -

$ python3 -m mlperf_logging.result_summarizer.compute_score --benchmark llama31_405b  --system tyche_ngpu512_ngc25.04_nemo --benchmark_folder /training_results_v5.0/NVIDIA/results/tyche_ngpu512_ngc25.04_nemo/llama31_405b --usage training --ruleset 5.0.0
ruleset 5.0.0
MLPerf training
Folder: /training_results_v5.0/NVIDIA/results/tyche_ngpu512_ngc25.04_nemo/llama31_405b
Version: 5.0.0
System: tyche_ngpu512_ngc25.04_nemo
Benchmark: llama31_405b
Score - Time to Train (minutes): 105.52301666666666

But if I don't manually delete scaling.json and run it without the --scale flag, it still does automatic scaling because there is a preexisting scaling.json file in the folder. @pgmpablo157321 - is this expected behavior and should we add some information in the README about how to deal with the scaling.json files in the folder?

ShriyaRishab · 2025-09-04T15:03:16Z

Testing power scores -

With --has_power

$ python3 -m mlperf_logging.result_summarizer.compute_score --benchmark llama2_70b_lora  --system xyz --benchmark_folder /training_results_v5.0/Lenovo/results/SR780aV3-8xB200_SXM_180GB/llama2_70b_lora --usage training --ruleset 5.0.0 --has_power
NOTICE: Applying scaling factor 1.0188034188034187 to dir /training_results_v5.0/Lenovo/results/SR780aV3-8xB200_SXM_180GB/llama2_70b_lora
MLPerf training
Folder: /training_results_v5.0/Lenovo/results/SR780aV3-8xB200_SXM_180GB/llama2_70b_lora
Version: 5.0.0
System: xyz
Benchmark: llama2_70b_lora
Score - Time to Train (minutes): 11.324490299145298
Power Score - Energy (kJ): 6114237.986822284

Without --has-power -

$ python3 -m mlperf_logging.result_summarizer.compute_score --benchmark llama2_70b_lora  --system xyz --benchmark_folder /training_results_v5.0/Lenovo/results/SR780aV3-8xB200_SXM_180GB/llama2_70b_lora --usage training --ruleset 5.0.0
NOTICE: Applying scaling factor 1.0188034188034187 to dir /training_results_v5.0/Lenovo/results/SR780aV3-8xB200_SXM_180GB/llama2_70b_lora
MLPerf training
Folder: /training_results_v5.0/Lenovo/results/SR780aV3-8xB200_SXM_180GB/llama2_70b_lora
Version: 5.0.0
System: xyz
Benchmark: llama2_70b_lora
Score - Time to Train (minutes): 11.324490299145298

After deleting scaling.json and with --has_power -

$ python3 -m mlperf_logging.result_summarizer.compute_score --benchmark llama2_70b_lora  --system xyz --benchmark_folder training_results_v5.0/Lenovo/results/SR780aV3-8xB200_SXM_180GB/llama2_70b_lora --usage training --ruleset 5.0.0 --has_power
MLPerf training
Folder: training_results_v5.0/Lenovo/results/SR780aV3-8xB200_SXM_180GB/llama2_70b_lora
Version: 5.0.0
System: xyz
Benchmark: llama2_70b_lora
Score - Time to Train (minutes): 11.11548125
Power Score - Energy (kJ): 6001391.312568853

ShriyaRishab · 2025-09-04T15:09:58Z

Few more issues that need to be dealt with -

Trying to compute scores or just 1 or 2 files returns None although it would help to just print out the individual scores of each of the files in the folder -

$ ls /temp_results
result_0.txt  result_1.txt
$ python3 -m mlperf_logging.result_summarizer.compute_score --benchmark llama31_405b  --system xyz --benchmark_folder /temp_results --usage training --ruleset 5.0.0
MLPerf training
Folder: /temp_results
Version: 5.0.0
System: xyz
Benchmark: llama31_405b
Score - Time to Train (minutes): None

Changing file names to be anything other than result_*.txt does not compute scores although this is expected.

$ ls /temp_results
0.txt  1.txt  2.txt
$ python3 -m mlperf_logging.result_summarizer.compute_score --benchmark llama31_405b  --system xyz --benchmark_folder /temp_results --usage training --ruleset 5.0.0
MLPerf training
Folder: /temp_results
Version: 5.0.0
System: xyz
Benchmark: llama31_405b
Score - Time to Train (minutes): None

ShriyaRishab · 2025-09-04T16:37:41Z

@pgmpablo157321 TODO items as discussed in the training WG

Always delete scaling.json file so that the scores are computed without scaling unless --scale is passed in which case, scaling.json is created and scores are printed after scaling.
When m<N log files are present, print score per file and also add a NOTICE stating that N logs are needed but only m are provided

Additional piece for (2) would be to also print the samples to converge along with the scores for each log file so submitters get a sense of their convergence as well. Is that also something we can add?

matthew-frank · 2025-09-04T16:39:25Z

mlperf_logging/result_summarizer/compute_score/__main__.py

+        description="Compute the score of a single benchmark",
+    )
+    parser.add_argument(
+        "--benchmark",


It should be unnecessary to specify benchmark from the command line. Rather this information should be gotten from the MLLOG submission_benchmark log line. (And additionally, it should be checked that the submission_benchmark log line of every log is the same.)

I don't think we need this standalone script to do everything that the compliance checker does. And I don't think passing --benchmark is much of a hassle either so it should be easier for submitters to do that and for the script to rely on that information from users, than failing if say the MLLOG lines are not correct.

This command line flag adds absolutely no value, and it is a hassle to figure out the canonical name of some benchmarks.

In some cases (where all we want to know is the raw score and/or the number of samples to converge) we don't need to know the benchmark name at all.

When we do need to know the benchmark name (to calculate the number of logs for olympic scoring and for rcp scaling), the command line flag is redundant (because the information is in the log files). If you don't want to check that the same benchmark name is in all the files that's fine. Just get the benchmark name out of the first file then.

it is a hassle to figure out the canonical name of some benchmarks.

That's fair.

If you don't want to check that the same benchmark name is in all the files that's fine. Just get the benchmark name out of the first file then.

Yeah I think this should work. @pgmpablo157321 ?

matthew-frank · 2025-09-04T17:08:39Z

mlperf_logging/result_summarizer/compute_score/__main__.py

+        "--has_power", action="store_true", help="Compute power score as well"
+    )
+    parser.add_argument(
+        "--benchmark_folder",


I'd recommend taking a list of files rather than a folder name. then the user could specify the list of files as folder/result*.txt to get all the result.txt files in a folder, but could also specify a single file, and could specify log files and directories that are named differently than result*.txt, like foo/bar/baz/*.log

matthew-frank · 2025-09-04T17:41:03Z

mlperf_logging/result_summarizer/compute_score/README.md

+
+
+**BENCHMARK:** Name of the benchmark to compute the score such as rgat, llama31_8b, etc.
+**SYSTEM_NAME:** The name of the system, it can be set to None.


recommend changing

The name of the system, it can be set to None

to

Optional system name

pgmpablo157321 force-pushed the standalone_score_compute branch from 8bd36f5 to f5cffb2 Compare August 29, 2025 19:28

pgmpablo157321 added 3 commits August 29, 2025 14:29

Standalone functions for computing score

46af800

Add module to compute standalone score

052425a

Setup main in compute score module

b917d71

pgmpablo157321 force-pushed the standalone_score_compute branch from f5cffb2 to b917d71 Compare August 29, 2025 19:29

pgmpablo157321 marked this pull request as ready for review August 29, 2025 19:29

pgmpablo157321 requested review from a team as code owners August 29, 2025 19:29

Add README + format script

072d88d

ShriyaRishab approved these changes Sep 4, 2025

View reviewed changes

matthew-frank suggested changes Sep 4, 2025

View reviewed changes

matthew-frank reviewed Sep 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Standalone score compute #433

Standalone score compute #433

Uh oh!

pgmpablo157321 commented Aug 28, 2025

Uh oh!

github-actions bot commented Aug 28, 2025 •

edited

Loading

Uh oh!

ShriyaRishab commented Sep 4, 2025 •

edited

Loading

Uh oh!

ShriyaRishab commented Sep 4, 2025

Uh oh!

ShriyaRishab commented Sep 4, 2025 •

edited

Loading

Uh oh!

ShriyaRishab commented Sep 4, 2025

Uh oh!

matthew-frank Sep 4, 2025

Uh oh!

ShriyaRishab Sep 4, 2025 •

edited

Loading

Uh oh!

matthew-frank Sep 4, 2025 •

edited

Loading

Uh oh!

ShriyaRishab Sep 4, 2025

Uh oh!

matthew-frank Sep 4, 2025

Uh oh!

matthew-frank Sep 4, 2025 •

edited

Loading

Uh oh!

Uh oh!



		BENCHMARK: Name of the benchmark to compute the score such as rgat, llama31_8b, etc.
		SYSTEM_NAME: The name of the system, it can be set to None.

Standalone score compute #433

Are you sure you want to change the base?

Standalone score compute #433

Uh oh!

Conversation

pgmpablo157321 commented Aug 28, 2025

Uh oh!

github-actions bot commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShriyaRishab commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShriyaRishab commented Sep 4, 2025

Uh oh!

ShriyaRishab commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShriyaRishab commented Sep 4, 2025

Uh oh!

matthew-frank Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

ShriyaRishab Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthew-frank Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShriyaRishab Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

matthew-frank Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

matthew-frank Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Aug 28, 2025 •

edited

Loading

ShriyaRishab commented Sep 4, 2025 •

edited

Loading

ShriyaRishab commented Sep 4, 2025 •

edited

Loading

ShriyaRishab Sep 4, 2025 •

edited

Loading

matthew-frank Sep 4, 2025 •

edited

Loading

matthew-frank Sep 4, 2025 •

edited

Loading