Consolidate example script tests into single parametrized test #1801

fynnsu · 2025-09-05T17:39:46Z

SUMMARY:

Combines the logic from tests/examples/test_*.py into tests/examples/test_example_scripts.py which has a parametrized test fn that runs them instead.

Notes:

Removed README parsing: some tests previously parsed a README file and then ran a code block from that file. However, these code blocks where always just python3 some_script.py, so I replaced these by just calling the script instead.
A handful of tests had additional extra handling. I maintained the behavior of these by adding additional options to the TestCase namedtuple / making the test function more flexible. (i.e. pre-processing, flags, and post-processing verification)

TEST PLAN:

All of the example tests essentially boil down to "run the example script and check if it crashes". A few also have additional checks or preprocessing steps but that is the main idea. To run the scripts, the tests all (both before and after changes) call run_cli_command in tests/testing_utils.py which does the actual python ... call.

Therefore, to test this change I replaced that function with a dummy function that just prints the command and returns a success. Then I verified that the printed commands matched before and after changes (excluding reorderings of the calls).

def run_cli_command(cmd: List[str], cwd: Optional[Union[str, Path]] = None):
    print()
    print(" ".join(cmd), "in", str(cwd).split("examples/")[-1])

    class DummyReturn:
        returncode = 0

    return DummyReturn

    # return run(cmd, stdout=PIPE, stderr=STDOUT, check=False, encoding="utf-8", cwd=cwd)

In addition, to verify that the special handling works correctly I ran the full examples test suite and confirmed that the tests with special handling still passed.

github-actions · 2025-09-05T17:39:56Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist

Summary of Changes

Hello @fynnsu, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request streamlines the testing infrastructure for example scripts by centralizing all example tests into a single, highly configurable pytest file. This refactoring aims to reduce code duplication, enhance test maintainability, and simplify the process of adding new example script tests, ultimately making the test suite more efficient and easier to manage.

Highlights

Test File Consolidation: Multiple individual test files (e.g., test_awq.py, test_quantization_w4a4_fp4.py) have been removed and their logic consolidated into a single, parametrized test file: test_example_scripts.py. This significantly reduces boilerplate and improves test maintainability.
Flexible Test Case Definition: A new TestCase NamedTuple has been introduced in test_example_scripts.py to allow for flexible configuration of each example script test. This includes defining command-line flags, pre-processing steps, and post-execution verification functions for specific test scenarios.
Simplified Script Execution: The previous approach of parsing README files to extract and run example commands has been removed. Tests now directly call the example scripts, simplifying the test setup and making it more robust.
Refactored Utility Functions: The tests/examples/utils.py file has been refactored. Generic functions for copying and running scripts, and parsing READMEs, have been removed. New, specific verification functions (e.g., verify_2of4_w4a16_output, verify_w4a4_fp4_output) have been added to support the consolidated test cases.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

kylesayrs

Great work, just cleanup suggestions from me

tests/examples/utils.py

kylesayrs · 2025-09-05T17:43:35Z

tests/examples/test_example_scripts.py

+            "trl_mixin/ex_trl_distillation.py",
+            marks=(
+                pytest.mark.skip("disabled until further updates"),
+                pytest.mark.multi_gpu,


Can you explain how you're handling examples which require multiple GPUs?

Yes, so all but two of the original tests had requires_gpu_count(1). The two exceptions are

trl_mixin/ex_trl_distillation.py: This test is also currently skipped. The "handling" is essentially just the pytest.mark.multi_gpu, which I copied over from the old implementation. I'm not sure if that gets used anywhere to be honest, but I maintained the existing behavior by adding it here. This is also one of the tests/examples I briefly advocated for removing in yesterday's meeting because it seems outdated and is skipped anyways. I copied it over here so that this pr is a refactor not a real code change, but I would also be good with removing it.

quantizing_moe/mixtral_example.py: This test was actually being called twice previously (once with requires_gpu_count(2) + pytest.mark.multi_gpu and once without). Because it was already completing successfully on the single gpu runs, I removed the second test.

That being said, this pytest.param(..., marks=...) structure does seem to work well, so it would be easy to mark any individual test with requires_gpu_count(2) which would then cause it only be run in multi-gpu environments. If you like me to restore the second mixtral test, or if we want in the future to add multi-gpu tests, that should be easy to do.

Thanks. All that matters is that we have a system for adding multi-gpu tests moving foward.

I think something like this would be good moving forward.

Suggested change

pytest.mark.multi_gpu,

requires_gpu_count(2),

Just to chime in briefly, these two marks serve different purposes.

pytest.mark.multi_gpu is a custom mark that is used primarily for test collection. For example, example tests are run across multiple jobs, some using multiple GPUs and some using a single GPU, and thus they use the markers to know which tests to run.

requires_gpu_count(num) is more of a guard to ensure that, if the tests are run without specificity on which to execute, then it will cause tests that cannot succeed due to insufficient hardware to be skipped as to not waste time executing until they eventually fail. Similarly, I had added a more appropriate one based on required VRAM since that is a key benefit of using multiple GPUs, but only one test I think ended up using that because it was hard to pin down what appropriate VRAM requirements were for different models/tests/etc.

That makes sense. Perhaps we could update requires_gpu_count to automatically add the pytest.mark.multi_gpu marker if num > 1?

Okay, I ended up just adding the mark.multi_gpu manually.

And restored the second mixtral.py test I had previously removed because that is the only test that is currently running on mark.multi_gpu

gemini-code-assist · 2025-09-05T17:48:04Z

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

dsikka · 2025-09-05T18:20:25Z

Please don’t merge without @dbarbuzzi’s approval

kylesayrs · 2025-09-05T19:07:14Z

tests/examples/test_example_scripts.py

+            "trl_mixin/ex_trl_distillation.py",
+            marks=(
+                pytest.mark.skip("disabled until further updates"),
+                pytest.mark.multi_gpu,


Thanks. All that matters is that we have a system for adding multi-gpu tests moving foward.

I think something like this would be good moving forward.

Suggested change

pytest.mark.multi_gpu,

requires_gpu_count(2),

tests/examples/utils.py

dbarbuzzi

I think this is generally fine as long as the overall coverage is kept and the marks are preserved (e.g. pytest.mark.multi_gpu) as the marks are critical in how the test cases are executed to balance load across limited resources.

If interested, one refactor I had on my to-do list was to update how the script execution output was handled. Currently, it is inserted as part of the failure message, which was originally done so that it would only be written to output if the test failed. However, even if we are to keep that condition, I wanted to move it out of the error message; instead, just logging it separately (potentially only if the test failed if we wanted to keep similar behavior). I only point this out in case we wanted to try and incorporate that into this refactor or make a new task for future work.

One quick note, though, that I’ll leave to stakeholders to consider:

Removed README parsing […]

Historically, one of the intents of the examples testing has been to test code blocks from the README as they were considered part of the example. We wanted to ensure that end users who were utilizing the rendered README (not just raw Markdown) and potentially copying code from it were not going to hit errors in the copied code.

If we are not as concerned with this aspect any longer, it’s fine to remove these tests. However, if we do want to cover that angle, they shouldn’t be removed, though perhaps they can be modified with a different approach to validate the command without actually executing it (though it should still be extracted from rendered Markdown if this is the case, not the raw source).

tests/examples/test_example_scripts.py

fynnsu · 2025-09-09T15:52:58Z

Historically, one of the intents of the examples testing has been to test code blocks from the README as they were considered part of the example. We wanted to ensure that end users who were utilizing the rendered README (not just raw Markdown) and potentially copying code from it were not going to hit errors in the copied code.

If we are not as concerned with this aspect any longer, it’s fine to remove these tests. However, if we do want to cover that angle, they shouldn’t be removed, though perhaps they can be modified with a different approach to validate the command without actually executing it (though it should still be extracted from rendered Markdown if this is the case, not the raw source).

I agree with the idea of testing the README codeblocks, however all of the existing README parsing tests were just running the "Quickstart" codeblocks which just call python some_script.py.

You can also see this because a lot of the README tests have

assert command.startswith("python")

I think it could be valuable to test the other walkthrough codeblocks in some way and we could use README parsing for that. But the current system is basically just testing that the script name in the readme matches, while adding a lot of complexity. If we think that's worthwhile confirming then I can add a test which just checks?

fynnsu · 2025-09-09T15:58:38Z

If interested, one refactor I had on my to-do list was to update how the script execution output was handled. Currently, it is inserted as part of the failure message, which was originally done so that it would only be written to output if the test failed. However, even if we are to keep that condition, I wanted to move it out of the error message; instead, just logging it separately (potentially only if the test failed if we wanted to keep similar behavior). I only point this out in case we wanted to try and incorporate that into this refactor or make a new task for future work.

Yeah I noticed the script outputs are currently suppressed while running. I think it would be useful to record the logs for all of the tests. Although maybe writing these to log files and uploading them as artifacts might be one way to do this without cluttering the console output.

I think this could be a separate pr, but I can help implement it if you'd like.

dbarbuzzi · 2025-09-09T16:03:17Z

I agree with the idea of testing the README codeblocks, however all of the existing README parsing tests were just running the "Quickstart" codeblocks which just call python some_script.py.

As trivial as it sounds, that is actually part of the original intent (because there were occasionally things like typos in file names or the file gets renamed without the command being updated, etc.). Therefore, a lighter weight validation is still warranted if we are concerned with README contents (maybe checks that the command is basically comprised of two tokens – python/python3 and a filename that matches one of the scripts in the current example’s folder).

fynnsu · 2025-09-09T16:07:02Z

Okay that's fair. I think I would still rather test that explicitly in a separate test and keep the script running tests clean. I will add the test to this pr.

fynnsu · 2025-09-10T19:36:59Z

@dbarbuzzi Please take a look when you get a chance. I added explicit readme tests to confirm that the script exists and is called using python or python3. I think I have also resolved the multi_gpu marking now.

I was initially concerned that the readme tests would interfere with sharding (if all 6 lightweight readme were on one gpu and displaced other jobs), but

I tested this manually (e.g. pytest tests/examples/ --collect-only --shard-id 0 --num-shards 2) and the tests were split evenly
I looked at the source code for pytest-shard and I think the distribution of tests on shards doesn't depend on other tests but rather just the hash of each tests name

Signed-off-by: Fynn Schmitt-Ulms <[email protected]>

gemini-code-assist bot reviewed Sep 5, 2025

View reviewed changes

kylesayrs reviewed Sep 5, 2025

View reviewed changes

dsikka requested a review from dbarbuzzi September 5, 2025 18:20

kylesayrs previously approved these changes Sep 5, 2025

View reviewed changes

fynnsu dismissed kylesayrs’s stale review via a865774 September 5, 2025 19:37

dbarbuzzi previously approved these changes Sep 9, 2025

View reviewed changes

tests/examples/test_example_scripts.py Outdated Show resolved Hide resolved

fynnsu dismissed dbarbuzzi’s stale review via 3b7d906 September 9, 2025 16:04

fynnsu added the ready When a PR is ready for review label Sep 17, 2025

fynnsu added 7 commits September 17, 2025 13:41

Consolidate example script tests into single parametrized test

16ad911

Signed-off-by: Fynn Schmitt-Ulms <[email protected]>

Cleanup

b904191

Signed-off-by: Fynn Schmitt-Ulms <[email protected]>

Move examples test helper fns

0d51cab

Signed-off-by: Fynn Schmitt-Ulms <[email protected]>

Replace raise call with assert

655df7e

Signed-off-by: Fynn Schmitt-Ulms <[email protected]>

Add 2 gpu mixtral example test

deafbd9

Signed-off-by: Fynn Schmitt-Ulms <[email protected]>

Add readme test to examples

a1b0aea

Signed-off-by: Fynn Schmitt-Ulms <[email protected]>

lint

8fde9b8

Signed-off-by: Fynn Schmitt-Ulms <[email protected]>

fynnsu force-pushed the fynnsu/nuqprsvzyvrm branch from 3a44a06 to 8fde9b8 Compare September 17, 2025 17:41

fynnsu requested review from dbarbuzzi and kylesayrs September 17, 2025 19:14

Consolidate example script tests into single parametrized test #1801

Are you sure you want to change the base?

Consolidate example script tests into single parametrized test #1801

Conversation

fynnsu commented Sep 5, 2025

Uh oh!

github-actions bot commented Sep 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

kylesayrs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kylesayrs Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

fynnsu Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

kylesayrs Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

dbarbuzzi Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

fynnsu Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

fynnsu Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot commented Sep 5, 2025

Uh oh!

dsikka commented Sep 5, 2025

Uh oh!

kylesayrs Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dbarbuzzi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fynnsu commented Sep 9, 2025

Uh oh!

fynnsu commented Sep 9, 2025

Uh oh!

dbarbuzzi commented Sep 9, 2025

Uh oh!

fynnsu commented Sep 9, 2025

Uh oh!

fynnsu commented Sep 10, 2025

Uh oh!

Uh oh!