-
Notifications
You must be signed in to change notification settings - Fork 231
Consolidate example script tests into single parametrized test #1801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @fynnsu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request streamlines the testing infrastructure for example scripts by centralizing all example tests into a single, highly configurable pytest file. This refactoring aims to reduce code duplication, enhance test maintainability, and simplify the process of adding new example script tests, ultimately making the test suite more efficient and easier to manage.
Highlights
- Test File Consolidation: Multiple individual test files (e.g., test_awq.py, test_quantization_w4a4_fp4.py) have been removed and their logic consolidated into a single, parametrized test file: test_example_scripts.py. This significantly reduces boilerplate and improves test maintainability.
- Flexible Test Case Definition: A new
TestCase
NamedTuple has been introduced in test_example_scripts.py to allow for flexible configuration of each example script test. This includes defining command-line flags, pre-processing steps, and post-execution verification functions for specific test scenarios. - Simplified Script Execution: The previous approach of parsing README files to extract and run example commands has been removed. Tests now directly call the example scripts, simplifying the test setup and making it more robust.
- Refactored Utility Functions: The
tests/examples/utils.py
file has been refactored. Generic functions for copying and running scripts, and parsing READMEs, have been removed. New, specific verification functions (e.g.,verify_2of4_w4a16_output
,verify_w4a4_fp4_output
) have been added to support the consolidated test cases.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, just cleanup suggestions from me
"trl_mixin/ex_trl_distillation.py", | ||
marks=( | ||
pytest.mark.skip("disabled until further updates"), | ||
pytest.mark.multi_gpu, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain how you're handling examples which require multiple GPUs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so all but two of the original tests had requires_gpu_count(1)
. The two exceptions are
trl_mixin/ex_trl_distillation.py
: This test is also currently skipped. The "handling" is essentially just thepytest.mark.multi_gpu
, which I copied over from the old implementation. I'm not sure if that gets used anywhere to be honest, but I maintained the existing behavior by adding it here. This is also one of the tests/examples I briefly advocated for removing in yesterday's meeting because it seems outdated and is skipped anyways. I copied it over here so that this pr is a refactor not a real code change, but I would also be good with removing it.quantizing_moe/mixtral_example.py
: This test was actually being called twice previously (once withrequires_gpu_count(2)
+pytest.mark.multi_gpu
and once without). Because it was already completing successfully on the single gpu runs, I removed the second test.
That being said, this pytest.param(..., marks=...)
structure does seem to work well, so it would be easy to mark any individual test with requires_gpu_count(2)
which would then cause it only be run in multi-gpu environments. If you like me to restore the second mixtral test, or if we want in the future to add multi-gpu tests, that should be easy to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. All that matters is that we have a system for adding multi-gpu tests moving foward.
I think something like this would be good moving forward.
pytest.mark.multi_gpu, | |
requires_gpu_count(2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to chime in briefly, these two marks serve different purposes.
pytest.mark.multi_gpu
is a custom mark that is used primarily for test collection. For example, example tests are run across multiple jobs, some using multiple GPUs and some using a single GPU, and thus they use the markers to know which tests to run.
requires_gpu_count(num)
is more of a guard to ensure that, if the tests are run without specificity on which to execute, then it will cause tests that cannot succeed due to insufficient hardware to be skipped as to not waste time executing until they eventually fail. Similarly, I had added a more appropriate one based on required VRAM since that is a key benefit of using multiple GPUs, but only one test I think ended up using that because it was hard to pin down what appropriate VRAM requirements were for different models/tests/etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. Perhaps we could update requires_gpu_count
to automatically add the pytest.mark.multi_gpu
marker if num > 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I ended up just adding the mark.multi_gpu
manually.
And restored the second mixtral.py
test I had previously removed because that is the only test that is currently running on mark.multi_gpu
Warning Gemini encountered an error creating the review. You can try again by commenting |
Please don’t merge without @dbarbuzzi’s approval |
"trl_mixin/ex_trl_distillation.py", | ||
marks=( | ||
pytest.mark.skip("disabled until further updates"), | ||
pytest.mark.multi_gpu, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. All that matters is that we have a system for adding multi-gpu tests moving foward.
I think something like this would be good moving forward.
pytest.mark.multi_gpu, | |
requires_gpu_count(2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is generally fine as long as the overall coverage is kept and the marks are preserved (e.g. pytest.mark.multi_gpu
) as the marks are critical in how the test cases are executed to balance load across limited resources.
If interested, one refactor I had on my to-do list was to update how the script execution output was handled. Currently, it is inserted as part of the failure message, which was originally done so that it would only be written to output if the test failed. However, even if we are to keep that condition, I wanted to move it out of the error message; instead, just logging it separately (potentially only if the test failed if we wanted to keep similar behavior). I only point this out in case we wanted to try and incorporate that into this refactor or make a new task for future work.
One quick note, though, that I’ll leave to stakeholders to consider:
Removed README parsing […]
Historically, one of the intents of the examples testing has been to test code blocks from the README as they were considered part of the example. We wanted to ensure that end users who were utilizing the rendered README (not just raw Markdown) and potentially copying code from it were not going to hit errors in the copied code.
If we are not as concerned with this aspect any longer, it’s fine to remove these tests. However, if we do want to cover that angle, they shouldn’t be removed, though perhaps they can be modified with a different approach to validate the command without actually executing it (though it should still be extracted from rendered Markdown if this is the case, not the raw source).
I agree with the idea of testing the README codeblocks, however all of the existing README parsing tests were just running the "Quickstart" codeblocks which just call
You can also see this because a lot of the README tests have assert command.startswith("python") I think it could be valuable to test the other walkthrough codeblocks in some way and we could use README parsing for that. But the current system is basically just testing that the script name in the readme matches, while adding a lot of complexity. If we think that's worthwhile confirming then I can add a test which just checks? |
Yeah I noticed the script outputs are currently suppressed while running. I think it would be useful to record the logs for all of the tests. Although maybe writing these to log files and uploading them as artifacts might be one way to do this without cluttering the console output. I think this could be a separate pr, but I can help implement it if you'd like. |
As trivial as it sounds, that is actually part of the original intent (because there were occasionally things like typos in file names or the file gets renamed without the command being updated, etc.). Therefore, a lighter weight validation is still warranted if we are concerned with README contents (maybe checks that the command is basically comprised of two tokens – |
Okay that's fair. I think I would still rather test that explicitly in a separate test and keep the script running tests clean. I will add the test to this pr. |
@dbarbuzzi Please take a look when you get a chance. I added explicit readme tests to confirm that the script exists and is called using I was initially concerned that the readme tests would interfere with sharding (if all 6 lightweight readme were on one gpu and displaced other jobs), but
|
Signed-off-by: Fynn Schmitt-Ulms <[email protected]>
Signed-off-by: Fynn Schmitt-Ulms <[email protected]>
Signed-off-by: Fynn Schmitt-Ulms <[email protected]>
Signed-off-by: Fynn Schmitt-Ulms <[email protected]>
Signed-off-by: Fynn Schmitt-Ulms <[email protected]>
Signed-off-by: Fynn Schmitt-Ulms <[email protected]>
Signed-off-by: Fynn Schmitt-Ulms <[email protected]>
3a44a06
to
8fde9b8
Compare
SUMMARY:
Combines the logic from
tests/examples/test_*.py
intotests/examples/test_example_scripts.py
which has a parametrized test fn that runs them instead.Notes:
python3 some_script.py
, so I replaced these by just calling the script instead.TestCase
namedtuple / making the test function more flexible. (i.e. pre-processing, flags, and post-processing verification)TEST PLAN:
All of the example tests essentially boil down to "run the example script and check if it crashes". A few also have additional checks or preprocessing steps but that is the main idea. To run the scripts, the tests all (both before and after changes) call
run_cli_command
intests/testing_utils.py
which does the actualpython ...
call.Therefore, to test this change I replaced that function with a dummy function that just prints the command and returns a success. Then I verified that the printed commands matched before and after changes (excluding reorderings of the calls).
In addition, to verify that the special handling works correctly I ran the full examples test suite and confirmed that the tests with special handling still passed.