-
Notifications
You must be signed in to change notification settings - Fork 19
[Quantization] Support more than one quant-compressor #415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@@ -164,7 +164,7 @@ def from_pretrained_model( | |||
cls, | |||
model: Module, | |||
sparsity_config: Union[SparsityCompressionConfig, str, None] = None, | |||
quantization_format: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaict this is the only entrypoint for this function.
Why not just adjust the upstream function infer_quantization_format to infer the mixed value? Rather than supporting an extra data type (List[str]) which ideally should never actually appear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @kylesayrs on this, also if a list of quantization formats are passed in we override them to mixed precision format and then infer them again downstream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree. Separation of concern. The infer_quantization_format is responsible for inferring the formats in the model but what gets written to the config should be determined by the ModelCompressor class which is ultimately responsible for writing the quantization config
We dont infer again - we use the per module format attached to each scheme to compress each module.
See the updated llmcompressor functionality: vllm-project/llm-compressor#1713
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Afaict the only reason why we would need to infer the list of used quantization formats in a model is to write to the config. I since model_compressor is responsible for writing to the config, I would argue that the "infer global quantization tag for the purposes of writing to config" logic should exist in model compressor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to pass all available formats, why are we then re inferring afterwards via _fetch_unique_quantization_formats
? This seems like a potential conflict in source of truth.
Ideally scheme.format
should be the source of truth of formats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉 LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice feature, agree with @kylesayrs 's recommendation, + updating docstrings and adding a test specifically for mixed precision compression/decompression
@@ -164,7 +164,7 @@ def from_pretrained_model( | |||
cls, | |||
model: Module, | |||
sparsity_config: Union[SparsityCompressionConfig, str, None] = None, | |||
quantization_format: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @kylesayrs on this, also if a list of quantization formats are passed in we override them to mixed precision format and then infer them again downstream?
self.quantization_compressor: Optional[ | ||
Union[BaseQuantizationCompressor, DenseCompressor] | ||
Dict[str, Union[BaseQuantizationCompressor, DenseCompressor]] | ||
] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we rename to self.quantization_compressors
to indicate this is now a dict? or is there some reason we can't because it's serialized etc.?
# Note - compress only supports one compression format atm | ||
quant_compressor = next(iter(self.quantization_compressor)) | ||
state_dict = quant_compressor.compress( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will we get around this constraint of compress
only supporting one format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will have to expand its functionality. This pathway is no longer used by llmcompressor so no immediate requirement.
src/compressed_tensors/compressors/model_compressors/model_compressor.py
Show resolved
Hide resolved
@@ -164,7 +164,7 @@ def from_pretrained_model( | |||
cls, | |||
model: Module, | |||
sparsity_config: Union[SparsityCompressionConfig, str, None] = None, | |||
quantization_format: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to pass all available formats, why are we then re inferring afterwards via _fetch_unique_quantization_formats
? This seems like a potential conflict in source of truth.
Ideally scheme.format
should be the source of truth of formats.
cd324dd
to
8b5d4c9
Compare
Seems like there are 3 sources of truth for quantization format
It'd be nice if def get_model_compression_format(model: torch.nn.Module) -> Set[CompressionFormat]:
return set(
getattr_chain(module, "quantization_scheme.format", CompressionFormat.dense)
for module in model.modules()
)
|
We still support the global compression format to be overwritten but this is not a common pathway which is why it was not part of this PR change for the per-module case. Ideally, we can also update our preset schemes to include the compression formats as well. But again, not what this PR is targeting as not our typical user pathway. I agree we can remove |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this been tested with model reloading? I see a couple potential issues there.
In the case where we want to load a model which has mixed compression
- from_pretrained_model and from_compression_config both set
quantization_config.format
to be"mixed"
. Ifquantization_config.format
is set,_fetch_unique_quantization_formats
will not be called - Since the model_compressor assumes that module formats have previously been set by
infer_per_module_quantization_format
and this function only, will this work for pathways in which we compress models without callinginfer_per_module_quantization_format
first?
There seems to be implicit coupling of infer_per_module_quantization_format
, ModelCompressor.from_pretrained_model
and ModelCompressor.compress/decompress
, where infer_per_module_quantization_format
must be called before the others. If we're going to do this, we should raise errors if a module has scheme.format = None
.
compression_formats = None | ||
if quantization_format is not None: | ||
# llmcompressor incorrectly passes in a CompressionFormat when | ||
# the value string is expected - handle both cases | ||
if isinstance(quantization_format, (str, CompressionFormat)): | ||
quantization_format = [quantization_format] | ||
|
||
compression_formats = quantization_format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI this parsing logic is duplicated in from_pretrained_model
and decompress_model
.
|
||
# If empty list, fallback to using the global format | ||
if len(quantization_formats) == 0: | ||
quantization_formats.append(self.quantization_config.format) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.quantization_config.format
is nullable afaict, please add logic and/or typehint to account for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving with the following list of follow-ups
Follow ups directly in scope of this PR
- Consider inferring compression format on a per-module. This enables users to manually specify formats (useful for debugging at the least), and more importantly decouples compression from requiring that
infer_quantization_format
be called prior.
def get_module_format(module):
qscheme = module.quantization_scheme
sscheme = module.sparsity_scheme # or from a map
inferred_format = infer_compression_format(qscheme, sscheme)
if qscheme is not None and qscheme != inferred_format:
# warn
...
We can still use a global override by passing the global override to this function
- Consider only inferring the
format
label at config serialization time, rather than prior. This avoids having to pass and parse the format in multiple places as well as avoids user or model loading code from accidentally passing "mixed" as a format.
def update_config(self, model):
config[QUANTIZATION_CONFIG_NAME].format = get_model_format(model)
def get_model_format(model):
return set(get_module_format(module) for module in model.modules())
Follow ups that are related but might make implementation easier
- Consider refactoring compressors into functions, not objects
def compress_model(model):
for name, module in model.named_modules():
format = get_compression_format(module)
module = compress_module(module, format)
set_module(model, name, module)
def compress_module(module, format):
if format == CompressionFormat.dense:
return module
if format == CompressionFormat.Sparse24:
return Sparse24Compressor.compress_module(module)
...
- Consider refactoring format to not be nullable. This reduces required parsing logic and tightens type hinting
Summary
ModelCompressor.quantization_compressor
to now be a dictionary, such that more than one quantization compressor can be supportedmixed-precision
as a newCompressionFormat
- if more than one format is found within the model,mixed-precision
is set as the model's global format in itsconfig.json
format
to theQuantizationScheme
and leverages this per-module format field in order to fetch the appropriate compressor to compress the modelModelCompressor.compress
andModelCompressor.decompress
- onlycompress_model
anddecompress_model
currently support this functionality as compress/decompress essentially only support global formatsTesting:
Next Steps:
Example Updates
New config: