-
Notifications
You must be signed in to change notification settings - Fork 280
Description
I was trying to figure out, what does NNCF actually do in order to make the various fake quantization functions differentiable.
The original "Neural Network Compression Framework for fast model inference" paper claimed that
... methods proposed in [12], where quantization parameters are learned using gradient descent. In our framework we use a similar quantization method, along with other quantization schemes, while also providing the ability to automatically insert Fake Quantization operations in the model graph.
where reference 12 is "PACT: Parameterized Clipping Activation for Quantized Neural Networks".
However, it seemingly doesn't disclose which "similar quantization method" it actually uses. It's not really PACT, since PACT doesn't support "proper" asymmetric quantization, requires replacing ReLU with a custom activation function and adds an extra regularization term into the loss.
Unlike PACT, NNCF just defines custom forward/backward functions that are differentiable wrt inputs, inputs_low and inputs_range (without relying on custom activation functions or extra loss terms). The implementation for the custom forward/backward functions in question can be found in
src/nncf/torch/quantization/reference.py(pure python implementation)src/nncf/torch/extensions/src/quantization/cpu/functions_cpu.cpp(CPU)src/nncf/torch/extensions/src/quantization/cuda/functions_cuda_impl.cu(CUDA)
I think that the most similar paper to what NNCF does is actually "Learned Step Size Quantization" (aka LSQ), not PACT. Although NNCFs implementation doesn't quite match with LSQ.