Skip to content

Conversation

fynnsu
Copy link
Collaborator

@fynnsu fynnsu commented Sep 5, 2025

SUMMARY:

Combines the logic from tests/examples/test_*.py into tests/examples/test_example_scripts.py which has a parametrized test fn that runs them instead.

Notes:

  • Removed README parsing: some tests previously parsed a README file and then ran a code block from that file. However, these code blocks where always just python3 some_script.py, so I replaced these by just calling the script instead.
  • A handful of tests had additional extra handling. I maintained the behavior of these by adding additional options to the TestCase namedtuple / making the test function more flexible. (i.e. pre-processing, flags, and post-processing verification)

TEST PLAN:

All of the example tests essentially boil down to "run the example script and check if it crashes". A few also have additional checks or preprocessing steps but that is the main idea. To run the scripts, the tests all (both before and after changes) call run_cli_command in tests/testing_utils.py which does the actual python ... call.

Therefore, to test this change I replaced that function with a dummy function that just prints the command and returns a success. Then I verified that the printed commands matched before and after changes (excluding reorderings of the calls).

def run_cli_command(cmd: List[str], cwd: Optional[Union[str, Path]] = None):
    print()
    print(" ".join(cmd), "in", str(cwd).split("examples/")[-1])

    class DummyReturn:
        returncode = 0

    return DummyReturn

    # return run(cmd, stdout=PIPE, stderr=STDOUT, check=False, encoding="utf-8", cwd=cwd)

In addition, to verify that the special handling works correctly I ran the full examples test suite and confirmed that the tests with special handling still passed.

Copy link

github-actions bot commented Sep 5, 2025

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @fynnsu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request streamlines the testing infrastructure for example scripts by centralizing all example tests into a single, highly configurable pytest file. This refactoring aims to reduce code duplication, enhance test maintainability, and simplify the process of adding new example script tests, ultimately making the test suite more efficient and easier to manage.

Highlights

  • Test File Consolidation: Multiple individual test files (e.g., test_awq.py, test_quantization_w4a4_fp4.py) have been removed and their logic consolidated into a single, parametrized test file: test_example_scripts.py. This significantly reduces boilerplate and improves test maintainability.
  • Flexible Test Case Definition: A new TestCase NamedTuple has been introduced in test_example_scripts.py to allow for flexible configuration of each example script test. This includes defining command-line flags, pre-processing steps, and post-execution verification functions for specific test scenarios.
  • Simplified Script Execution: The previous approach of parsing README files to extract and run example commands has been removed. Tests now directly call the example scripts, simplifying the test setup and making it more robust.
  • Refactored Utility Functions: The tests/examples/utils.py file has been refactored. Generic functions for copying and running scripts, and parsing READMEs, have been removed. New, specific verification functions (e.g., verify_2of4_w4a16_output, verify_w4a4_fp4_output) have been added to support the consolidated test cases.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Collaborator

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, just cleanup suggestions from me

"trl_mixin/ex_trl_distillation.py",
marks=(
pytest.mark.skip("disabled until further updates"),
pytest.mark.multi_gpu,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain how you're handling examples which require multiple GPUs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so all but two of the original tests had requires_gpu_count(1). The two exceptions are

  1. trl_mixin/ex_trl_distillation.py: This test is also currently skipped. The "handling" is essentially just the pytest.mark.multi_gpu, which I copied over from the old implementation. I'm not sure if that gets used anywhere to be honest, but I maintained the existing behavior by adding it here. This is also one of the tests/examples I briefly advocated for removing in yesterday's meeting because it seems outdated and is skipped anyways. I copied it over here so that this pr is a refactor not a real code change, but I would also be good with removing it.
  2. quantizing_moe/mixtral_example.py: This test was actually being called twice previously (once with requires_gpu_count(2) + pytest.mark.multi_gpu and once without). Because it was already completing successfully on the single gpu runs, I removed the second test.

That being said, this pytest.param(..., marks=...) structure does seem to work well, so it would be easy to mark any individual test with requires_gpu_count(2) which would then cause it only be run in multi-gpu environments. If you like me to restore the second mixtral test, or if we want in the future to add multi-gpu tests, that should be easy to do.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. All that matters is that we have a system for adding multi-gpu tests moving foward.

I think something like this would be good moving forward.

Suggested change
pytest.mark.multi_gpu,
requires_gpu_count(2),

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to chime in briefly, these two marks serve different purposes.

pytest.mark.multi_gpu is a custom mark that is used primarily for test collection. For example, example tests are run across multiple jobs, some using multiple GPUs and some using a single GPU, and thus they use the markers to know which tests to run.

requires_gpu_count(num) is more of a guard to ensure that, if the tests are run without specificity on which to execute, then it will cause tests that cannot succeed due to insufficient hardware to be skipped as to not waste time executing until they eventually fail. Similarly, I had added a more appropriate one based on required VRAM since that is a key benefit of using multiple GPUs, but only one test I think ended up using that because it was hard to pin down what appropriate VRAM requirements were for different models/tests/etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Perhaps we could update requires_gpu_count to automatically add the pytest.mark.multi_gpu marker if num > 1?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I ended up just adding the mark.multi_gpu manually.

And restored the second mixtral.py test I had previously removed because that is the only test that is currently running on mark.multi_gpu

Copy link
Contributor

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

@dsikka
Copy link
Collaborator

dsikka commented Sep 5, 2025

Please don’t merge without @dbarbuzzi’s approval

@dsikka dsikka requested a review from dbarbuzzi September 5, 2025 18:20
kylesayrs
kylesayrs previously approved these changes Sep 5, 2025
"trl_mixin/ex_trl_distillation.py",
marks=(
pytest.mark.skip("disabled until further updates"),
pytest.mark.multi_gpu,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. All that matters is that we have a system for adding multi-gpu tests moving foward.

I think something like this would be good moving forward.

Suggested change
pytest.mark.multi_gpu,
requires_gpu_count(2),

dbarbuzzi
dbarbuzzi previously approved these changes Sep 9, 2025
Copy link
Collaborator

@dbarbuzzi dbarbuzzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is generally fine as long as the overall coverage is kept and the marks are preserved (e.g. pytest.mark.multi_gpu) as the marks are critical in how the test cases are executed to balance load across limited resources.

If interested, one refactor I had on my to-do list was to update how the script execution output was handled. Currently, it is inserted as part of the failure message, which was originally done so that it would only be written to output if the test failed. However, even if we are to keep that condition, I wanted to move it out of the error message; instead, just logging it separately (potentially only if the test failed if we wanted to keep similar behavior). I only point this out in case we wanted to try and incorporate that into this refactor or make a new task for future work.

One quick note, though, that I’ll leave to stakeholders to consider:

Removed README parsing […]

Historically, one of the intents of the examples testing has been to test code blocks from the README as they were considered part of the example. We wanted to ensure that end users who were utilizing the rendered README (not just raw Markdown) and potentially copying code from it were not going to hit errors in the copied code.

If we are not as concerned with this aspect any longer, it’s fine to remove these tests. However, if we do want to cover that angle, they shouldn’t be removed, though perhaps they can be modified with a different approach to validate the command without actually executing it (though it should still be extracted from rendered Markdown if this is the case, not the raw source).

@fynnsu
Copy link
Collaborator Author

fynnsu commented Sep 9, 2025

Historically, one of the intents of the examples testing has been to test code blocks from the README as they were considered part of the example. We wanted to ensure that end users who were utilizing the rendered README (not just raw Markdown) and potentially copying code from it were not going to hit errors in the copied code.

If we are not as concerned with this aspect any longer, it’s fine to remove these tests. However, if we do want to cover that angle, they shouldn’t be removed, though perhaps they can be modified with a different approach to validate the command without actually executing it (though it should still be extracted from rendered Markdown if this is the case, not the raw source).

I agree with the idea of testing the README codeblocks, however all of the existing README parsing tests were just running the "Quickstart" codeblocks which just call python some_script.py.

You can also see this because a lot of the README tests have

assert command.startswith("python")

I think it could be valuable to test the other walkthrough codeblocks in some way and we could use README parsing for that. But the current system is basically just testing that the script name in the readme matches, while adding a lot of complexity. If we think that's worthwhile confirming then I can add a test which just checks?

@fynnsu
Copy link
Collaborator Author

fynnsu commented Sep 9, 2025

If interested, one refactor I had on my to-do list was to update how the script execution output was handled. Currently, it is inserted as part of the failure message, which was originally done so that it would only be written to output if the test failed. However, even if we are to keep that condition, I wanted to move it out of the error message; instead, just logging it separately (potentially only if the test failed if we wanted to keep similar behavior). I only point this out in case we wanted to try and incorporate that into this refactor or make a new task for future work.

Yeah I noticed the script outputs are currently suppressed while running. I think it would be useful to record the logs for all of the tests. Although maybe writing these to log files and uploading them as artifacts might be one way to do this without cluttering the console output.

I think this could be a separate pr, but I can help implement it if you'd like.

@dbarbuzzi
Copy link
Collaborator

I agree with the idea of testing the README codeblocks, however all of the existing README parsing tests were just running the "Quickstart" codeblocks which just call python some_script.py.

As trivial as it sounds, that is actually part of the original intent (because there were occasionally things like typos in file names or the file gets renamed without the command being updated, etc.). Therefore, a lighter weight validation is still warranted if we are concerned with README contents (maybe checks that the command is basically comprised of two tokens – python/python3 and a filename that matches one of the scripts in the current example’s folder).

@fynnsu
Copy link
Collaborator Author

fynnsu commented Sep 9, 2025

Okay that's fair. I think I would still rather test that explicitly in a separate test and keep the script running tests clean. I will add the test to this pr.

@fynnsu
Copy link
Collaborator Author

fynnsu commented Sep 10, 2025

@dbarbuzzi Please take a look when you get a chance. I added explicit readme tests to confirm that the script exists and is called using python or python3. I think I have also resolved the multi_gpu marking now.

I was initially concerned that the readme tests would interfere with sharding (if all 6 lightweight readme were on one gpu and displaced other jobs), but

  1. I tested this manually (e.g. pytest tests/examples/ --collect-only --shard-id 0 --num-shards 2) and the tests were split evenly
  2. I looked at the source code for pytest-shard and I think the distribution of tests on shards doesn't depend on other tests but rather just the hash of each tests name

@fynnsu fynnsu added the ready When a PR is ready for review label Sep 17, 2025
Signed-off-by: Fynn Schmitt-Ulms <[email protected]>
Signed-off-by: Fynn Schmitt-Ulms <[email protected]>
Signed-off-by: Fynn Schmitt-Ulms <[email protected]>
Signed-off-by: Fynn Schmitt-Ulms <[email protected]>
Signed-off-by: Fynn Schmitt-Ulms <[email protected]>
Signed-off-by: Fynn Schmitt-Ulms <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready When a PR is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants