-
Notifications
You must be signed in to change notification settings - Fork 38
Description
Description
This issue addresses the high space requirements of large attribution scores tensors by adding a scores_precision parameter to FeatureAttributionOutput.save method.
Proposant: @g8a9
Motivation
Currently, tensors in FeatureAttributionOutput objects (attributions and step scores) are serialized in float32 precision as a default when using out.save(). While it is possible to compress the representation of these values with ndarray_compact=True, the resulting JSON files are usually quite large. Using more parsimonious data types could reduce the size of saved objects and facilitate systematic analyses leveraging large amounts of data.
Proposal
float32 precision should probably remain the default behavior, as we do not want to cause any information loss by default.
float16 and float8 should also be considered, both in the signed and unsigned variants, since leveraging the strictly positive nature of some score types would allow supporting greater precision while halving space requirements. Unsigned values will be used as defaults if no negative scores are present in a tensor.
float16 can be easily used by casting tensors to the native torch.float16 data type, which would preserve precision up to 4 decimal values for scores normalized in the [-1;1] interval (8 for unsigned tensors). This corresponds to 2 or 4 decimal places for float8. However, this data type is not supported natively in Pytorch, so tensors should be converted to torch.int8 and torch.uint8 instead and transformed in floats upon reloading the object.