Skip to content

Document, how does NNCF compute gradients for QAT #3870

@ruro

Description

@ruro

I was trying to figure out, what does NNCF actually do in order to make the various fake quantization functions differentiable.

The original "Neural Network Compression Framework for fast model inference" paper claimed that

... methods proposed in [12], where quantization parameters are learned using gradient descent. In our framework we use a similar quantization method, along with other quantization schemes, while also providing the ability to automatically insert Fake Quantization operations in the model graph.

where reference 12 is "PACT: Parameterized Clipping Activation for Quantized Neural Networks".

However, it seemingly doesn't disclose which "similar quantization method" it actually uses. It's not really PACT, since PACT doesn't support "proper" asymmetric quantization, requires replacing ReLU with a custom activation function and adds an extra regularization term into the loss.

Unlike PACT, NNCF just defines custom forward/backward functions that are differentiable wrt inputs, inputs_low and inputs_range (without relying on custom activation functions or extra loss terms). The implementation for the custom forward/backward functions in question can be found in

  • src/nncf/torch/quantization/reference.py (pure python implementation)
  • src/nncf/torch/extensions/src/quantization/cpu/functions_cpu.cpp (CPU)
  • src/nncf/torch/extensions/src/quantization/cuda/functions_cuda_impl.cu (CUDA)

I think that the most similar paper to what NNCF does is actually "Learned Step Size Quantization" (aka LSQ), not PACT. Although NNCFs implementation doesn't quite match with LSQ.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions