Skip to content

Conversation

pgmpablo157321
Copy link
Contributor

Fix #419

Copy link

github-actions bot commented Aug 28, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@pgmpablo157321 pgmpablo157321 force-pushed the standalone_score_compute branch from 8bd36f5 to f5cffb2 Compare August 29, 2025 19:28
@pgmpablo157321 pgmpablo157321 force-pushed the standalone_score_compute branch from f5cffb2 to b917d71 Compare August 29, 2025 19:29
@pgmpablo157321 pgmpablo157321 marked this pull request as ready for review August 29, 2025 19:29
@pgmpablo157321 pgmpablo157321 requested review from a team as code owners August 29, 2025 19:29
@ShriyaRishab
Copy link
Contributor

ShriyaRishab commented Sep 4, 2025

I tested it locally on a few results

With scaling.json -

$ python3 -m mlperf_logging.result_summarizer.compute_score --benchmark llama31_405b  --system tyche_ngpu512_ngc25.04_nemo --benchmark_folder /training_results_v5.0/NVIDIA/results/tyche_ngpu512_ngc25.04_nemo/llama31_405b --usage training --ruleset 5.0.0 --scale
NOTICE: Applying scaling factor 1.1538461538461537 to dir /training_results_v5.0/NVIDIA/results/tyche_ngpu512_ngc25.04_nemo/llama31_405b
MLPerf training
Folder: /training_results_v5.0/NVIDIA/results/tyche_ngpu512_ngc25.04_nemo/llama31_405b
Version: 5.0.0
System: tyche_ngpu512_ngc25.04_nemo
Benchmark: llama31_405b
Score - Time to Train (minutes): 121.7573269230769

Without --scale but scaling.json file still exists from previous run in the folder -

$ python3 -m mlperf_logging.result_summarizer.compute_score --benchmark llama31_405b  --system tyche_ngpu512_ngc25.04_nemo --benchmark_folder /training_results_v5.0/NVIDIA/results/tyche_ngpu512_ngc25.04_nemo/llama31_405b --usage training --ruleset 5.0.0
NOTICE: Applying scaling factor 1.1538461538461537 to dir /training_results_v5.0/NVIDIA/results/tyche_ngpu512_ngc25.04_nemo/llama31_405b
MLPerf training
Folder: /training_results_v5.0/NVIDIA/results/tyche_ngpu512_ngc25.04_nemo/llama31_405b
Version: 5.0.0
System: tyche_ngpu512_ngc25.04_nemo
Benchmark: llama31_405b
Score - Time to Train (minutes): 121.7573269230769

After manually deleting scaling.json -

$ python3 -m mlperf_logging.result_summarizer.compute_score --benchmark llama31_405b  --system tyche_ngpu512_ngc25.04_nemo --benchmark_folder /training_results_v5.0/NVIDIA/results/tyche_ngpu512_ngc25.04_nemo/llama31_405b --usage training --ruleset 5.0.0
ruleset 5.0.0
MLPerf training
Folder: /training_results_v5.0/NVIDIA/results/tyche_ngpu512_ngc25.04_nemo/llama31_405b
Version: 5.0.0
System: tyche_ngpu512_ngc25.04_nemo
Benchmark: llama31_405b
Score - Time to Train (minutes): 105.52301666666666

But if I don't manually delete scaling.json and run it without the --scale flag, it still does automatic scaling because there is a preexisting scaling.json file in the folder. @pgmpablo157321 - is this expected behavior and should we add some information in the README about how to deal with the scaling.json files in the folder?

@ShriyaRishab
Copy link
Contributor

Testing power scores -

With --has_power

$ python3 -m mlperf_logging.result_summarizer.compute_score --benchmark llama2_70b_lora  --system xyz --benchmark_folder /training_results_v5.0/Lenovo/results/SR780aV3-8xB200_SXM_180GB/llama2_70b_lora --usage training --ruleset 5.0.0 --has_power
NOTICE: Applying scaling factor 1.0188034188034187 to dir /training_results_v5.0/Lenovo/results/SR780aV3-8xB200_SXM_180GB/llama2_70b_lora
MLPerf training
Folder: /training_results_v5.0/Lenovo/results/SR780aV3-8xB200_SXM_180GB/llama2_70b_lora
Version: 5.0.0
System: xyz
Benchmark: llama2_70b_lora
Score - Time to Train (minutes): 11.324490299145298
Power Score - Energy (kJ): 6114237.986822284

Without --has-power -

$ python3 -m mlperf_logging.result_summarizer.compute_score --benchmark llama2_70b_lora  --system xyz --benchmark_folder /training_results_v5.0/Lenovo/results/SR780aV3-8xB200_SXM_180GB/llama2_70b_lora --usage training --ruleset 5.0.0
NOTICE: Applying scaling factor 1.0188034188034187 to dir /training_results_v5.0/Lenovo/results/SR780aV3-8xB200_SXM_180GB/llama2_70b_lora
MLPerf training
Folder: /training_results_v5.0/Lenovo/results/SR780aV3-8xB200_SXM_180GB/llama2_70b_lora
Version: 5.0.0
System: xyz
Benchmark: llama2_70b_lora
Score - Time to Train (minutes): 11.324490299145298

After deleting scaling.json and with --has_power -

$ python3 -m mlperf_logging.result_summarizer.compute_score --benchmark llama2_70b_lora  --system xyz --benchmark_folder training_results_v5.0/Lenovo/results/SR780aV3-8xB200_SXM_180GB/llama2_70b_lora --usage training --ruleset 5.0.0 --has_power
MLPerf training
Folder: training_results_v5.0/Lenovo/results/SR780aV3-8xB200_SXM_180GB/llama2_70b_lora
Version: 5.0.0
System: xyz
Benchmark: llama2_70b_lora
Score - Time to Train (minutes): 11.11548125
Power Score - Energy (kJ): 6001391.312568853

@ShriyaRishab
Copy link
Contributor

ShriyaRishab commented Sep 4, 2025

Few more issues that need to be dealt with -

Trying to compute scores or just 1 or 2 files returns None although it would help to just print out the individual scores of each of the files in the folder -

$ ls /temp_results
result_0.txt  result_1.txt
$ python3 -m mlperf_logging.result_summarizer.compute_score --benchmark llama31_405b  --system xyz --benchmark_folder /temp_results --usage training --ruleset 5.0.0
MLPerf training
Folder: /temp_results
Version: 5.0.0
System: xyz
Benchmark: llama31_405b
Score - Time to Train (minutes): None

Changing file names to be anything other than result_*.txt does not compute scores although this is expected.

$ ls /temp_results
0.txt  1.txt  2.txt
$ python3 -m mlperf_logging.result_summarizer.compute_score --benchmark llama31_405b  --system xyz --benchmark_folder /temp_results --usage training --ruleset 5.0.0
MLPerf training
Folder: /temp_results
Version: 5.0.0
System: xyz
Benchmark: llama31_405b
Score - Time to Train (minutes): None

@ShriyaRishab
Copy link
Contributor

@pgmpablo157321 TODO items as discussed in the training WG

  1. Always delete scaling.json file so that the scores are computed without scaling unless --scale is passed in which case, scaling.json is created and scores are printed after scaling.
  2. When m<N log files are present, print score per file and also add a NOTICE stating that N logs are needed but only m are provided

Additional piece for (2) would be to also print the samples to converge along with the scores for each log file so submitters get a sense of their convergence as well. Is that also something we can add?

description="Compute the score of a single benchmark",
)
parser.add_argument(
"--benchmark",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be unnecessary to specify benchmark from the command line. Rather this information should be gotten from the MLLOG submission_benchmark log line. (And additionally, it should be checked that the submission_benchmark log line of every log is the same.)

Copy link
Contributor

@ShriyaRishab ShriyaRishab Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this standalone script to do everything that the compliance checker does. And I don't think passing --benchmark is much of a hassle either so it should be easier for submitters to do that and for the script to rely on that information from users, than failing if say the MLLOG lines are not correct.

Copy link
Contributor

@matthew-frank matthew-frank Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command line flag adds absolutely no value, and it is a hassle to figure out the canonical name of some benchmarks.

In some cases (where all we want to know is the raw score and/or the number of samples to converge) we don't need to know the benchmark name at all.

When we do need to know the benchmark name (to calculate the number of logs for olympic scoring and for rcp scaling), the command line flag is redundant (because the information is in the log files). If you don't want to check that the same benchmark name is in all the files that's fine. Just get the benchmark name out of the first file then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a hassle to figure out the canonical name of some benchmarks.

That's fair.

If you don't want to check that the same benchmark name is in all the files that's fine. Just get the benchmark name out of the first file then.

Yeah I think this should work. @pgmpablo157321 ?

"--has_power", action="store_true", help="Compute power score as well"
)
parser.add_argument(
"--benchmark_folder",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend taking a list of files rather than a folder name. then the user could specify the list of files as folder/result*.txt to get all the result.txt files in a folder, but could also specify a single file, and could specify log files and directories that are named differently than result*.txt, like foo/bar/baz/*.log



**BENCHMARK:** Name of the benchmark to compute the score such as rgat, llama31_8b, etc.
**SYSTEM_NAME:** The name of the system, it can be set to None.
Copy link
Contributor

@matthew-frank matthew-frank Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recommend changing

The name of the system, it can be set to None

to

Optional system name

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can we have a simple script to compute training scores?
3 participants