From b152ca53fe6f3a2c4dbbd05b75b7079207895145 Mon Sep 17 00:00:00 2001 From: Dave Welsch Date: Tue, 28 Jan 2025 09:50:15 -0800 Subject: [PATCH] Edit mixed precision pages in feature guide. Signed-off-by: Dave Welsch --- Docs/featureguide/mixed precision/amp.rst | 135 +++++++++---------- Docs/featureguide/mixed precision/index.rst | 33 ++--- Docs/featureguide/mixed precision/mmp.rst | 142 ++++++++++++++++---- 3 files changed, 192 insertions(+), 118 deletions(-) diff --git a/Docs/featureguide/mixed precision/amp.rst b/Docs/featureguide/mixed precision/amp.rst index 7fbdfbe3703..f8a8f13970c 100644 --- a/Docs/featureguide/mixed precision/amp.rst +++ b/Docs/featureguide/mixed precision/amp.rst @@ -6,139 +6,130 @@ Automatic mixed precision ######################### -This technique helps choose per-layer integer bit-widths to retain model accuracy when run on +Automatic mixed precision (AMP) helps choose per-layer integer bit widths to retain model accuracy on fixed-point runtimes like |qnn|_. -As an example, say a particular model is not meeting a desired accuracy target when run in INT8. -The Auto Mixed Precision (AMP) feature will find a minimal set of layers that need to run on higher -precision, INT16 for example, to get to the desired quantized accuracy. +For example, consider a model that is not meeting an accuracy target when run in INT8. +AMP finds a minimal set of layers that need to run on higher precision, INT16 for example, to achieve the target accuracy. -Choosing a higher precision for some layers necessarily involves a trade-off: lower inferences/sec -for higher accuracy and vice-versa. The AMP feature will generate a pareto curve that can guide -the user to decide the right operating point for this tradeoff. +Choosing a higher precision for some layers involves a trade-off between performance (inferences per second) +and better accuracy. The AMP feature generates a Pareto curve you can use to help decide the right operating point for this tradeoff. Context ======= -For performing AMP, a user needs to start with a PyTorch, TensorFlow or ONNX model and create a -Quantization Simulation model :class:`QuantizationSimModel`. This QuantSim model, along with an +To perform AMP, you need a PyTorch, TensorFlow, or ONNX model. You use the model to create a +Quantization Simulation (QuantSim) model :class:`QuantizationSimModel`. This QuantSim model, along with an allowable accuracy drop, is passed to the API. -The function changes the QuantSim Sim model in place with different quantizers having different -bit-widths. This QuantSim model can be either exported or evaluated to get a quantization accuracy. +The API function changes the QuantSim model in-place with different bit-width quantizers. You can export or evaluate this QuantSim model to calculate a quantization accuracy. .. image:: ../../images/automatic_mixed_precision_1.png :width: 900px Mixed Precision Algorithm -========================= +------------------------- -The algorithm involves 4 phases: +The algorithm involves four phases as shown in the following image. .. image:: ../../images/automatic_mixed_precision_2.png :width: 700px -1) Find layer groups --------------------- +Phase 1: Find layer groups +~~~~~~~~~~~~~~~~~~~~~~~~~~ - Layer Groups are defined as a group of layers grouped together based on certain rules. - This helps in reducing search space over which the mixed precision algorithm operates. - It also ensures that we search only over the valid bit-width settings for parameters and activations. +Layer Groups are defined based on certain rules. +Grouping layers helps reduce the search space over which the mixed precision algorithm operates. +It also ensures that the search occurs only over the valid bit-width settings for parameters and activations. .. image:: ../../images/automatic_mixed_precision_3.png :width: 900px -2) Perform sensitivity analysis (Phase 1) ------------------------------------------ +Phase 2: Perform sensitivity analysis +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - In this phase the algorithm performs a per-layer group sensitivity analysis. - This will identify how sensitive is the model if we choose a lower quantization bit-width for a particular layer group. - The sensitivity analysis yields an accuracy list which is cached and can be re-used again by the algorithm. +The algorithm performs a per-layer group sensitivity analysis. +This identifies how sensitive the model is to lower quantization bit width for particular layer groups. +The sensitivity analysis creates and caches an accuracy list that is used in following phases by the algorithm. - Below is an example of a list generated using sensitivity analysis: +Following is an an accuracy list generated using sensitivity analysis: .. image:: ../../images/accuracy_list.png :width: 900px -3) Create a Pareto-front list (Phase 2) ---------------------------------------- +Phase 3: Create a Pareto-front list +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - A Pareto curve is a trade-off curve that describes how accuracy varies given a bit-ops target and vice versa. - The AMP algorithm yields a Pareto front curve which consists of layer groups changed up to that point, relative bit-ops (relative to starting bit-ops), - accuracy of the model, and the bit-width to which the layer group was changed to. +A Pareto curve or Pareto front describes the tradeoff between accuracy and bit-ops targets. +The AMP algorithm generates a Pareto curve showing, for each layer group changed: - An example of a Pareto list: +- Bitwidth: The bit width to which the layer group was changed +- Accuracy: The accuracy of the model +- Relative bit-ops: The bit-ops relative to starting - .. image:: ../../images/pareto.png - :width: 900px +An example of a Pareto list: - Bit-ops are computed as +.. image:: ../../images/pareto.png + :width: 900px - :math:`Bit-ops = Mac(op) * Bitwidth(parameter) * Bitwidth(Activation)` +Bit-ops are computed as: - The Pareto list can be used for plotting a Pareto curve. A Bokeh plot for Pareto curve is generated and saved in the results directory. +:math:`Bitops = Mac(op) * Bitwidth(parameter) * Bitwidth(Activation)` - .. image:: ../../images/pareto_curve.png - :width: 900px +The Pareto list can be used for plotting a Pareto curve. A plot of the Pareto curve is generated using Bokeh and saved in the results directory. -.. note:: +.. image:: ../../images/pareto_curve.png + :width: 900px - A user can pass two different evaluation callbacks for phase 1 and phase 2. Since phase 1 is measuring sensitivity - of each quantizer group, we can pass a smaller representative dataset for phase 1 for evaluation, or even use an indirect measure - such as SQNR which can be computed faster than but correlates well with the real evaluation metric. +You can pass two different evaluation callbacks for phase 1 and phase 2. -It is recommended to use the complete dataset for evaluation in phase 2. +Since phase 1 measures sensitivity of each quantizer group, it can use a smaller representative dataset for evaluation, or even use an indirect measure such as SQNR that correlates with the direct evaluation metric but can be computed faster. -4) Reduce Bit-width Convert Op Overhead (Phase 3) -------------------------------------------------- +We recommend that you use the complete dataset for evaluation in phase 2. -Convert Ops are introduced in the mixed-precision model for transition between Ops that are assigned different activation -bit-widths or data types (float vs int). These Convert Ops contribute to the inference time along with bit-operations of Ops. -In this phase the algorithm derives a mixed-precision solution having less Convert Op overhead w.r.t. to original solution -keeping the mixed-precision accuracy intact. The algorithm produces mixed-precision solutions for a range of alpha values -(0.0, 0.2, 0.4, 0.6, 0.8, 1.0) where the alpha represents fraction of original Convert Op overhead allowed for respective solution. +Phase 4: Reduce bit-width convert op overhead +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Use Cases -========= +Conversion operations (convert ops) are introduced in the mixed-precision model for transition between ops with different activation bit widths or data types (float vs int). Convert ops contribute to the inference time along with bit-operations of ops. -1) Choosing a very high accuracy drop (equivalent to setting allowed_accuracy_drop as None): +In this phase the algorithm derives a mixed-precision solution having less convert op overhead compared to the original solution, keeping the mixed-precision accuracy intact. The algorithm produces mixed-precision solutions for a range of alpha values (0.0, 0.2, 0.4, 0.6, 0.8, 1.0) where the alpha represents the fraction of original convert op overhead allowed for a respective solution. -AIMET allows a user to save intermediate states for computation of the Pareto list. Therefore, if a user computes a Pareto -list corresponding to an accuracy drop of None, they can view the complete profile of how model accuracy will vary as bit-ops vary. - -Thereafter, a user can visualize the Pareto curve plot and choose an optimal point for accuracy. The algorithm can be re-run with -the new accuracy drop to get a sim model with the required accuracy. +Use Cases +--------- -.. note:: +1: Choosing a very high accuracy drop (equivalent to setting allowed_accuracy_drop to None) + AIMET enables a user to save intermediate states for computation of the Pareto list. Computing a Pareto list corresponding to an accuracy drop of None generates the complete profile of model accuracy vs. bit-ops. You can thus visualize the Pareto curve plot and choose an optimal point for accuracy. The algorithm can be re-run with the new accuracy drop to get a sim model with the required accuracy. - The Pareto list is not modified during the second run. + .. note:: -2) Choosing a lower accuracy drop and then continuing to compute pareto list from this point if more accuracy drop is acceptable: + The Pareto list is not modified during the second run. -To enable this a user can use the clean_start parameter in the API. If clean_start is set to False then the Pareto list will -start computation from the last point where it left off. +2: Choosing a lower accuracy drop and then continuing to compute a Pareto list + Use this option if more accuracy drop is acceptable. Setting the clean_start parameter in the API to False causes the Pareto list to start computation from the point where it left off. -.. note:: + .. note:: - - It is recommended to set the clean_start parameter to False to use cached results for both use cases. - - If the model or candidate bit-widths change, the user needs to do a clean start. + - We recommend that you set the clean_start parameter to False to use cached results for both use cases. + - If the model or candidate bit widths change, you must do a clean start. Workflow ======== -Code example ------------- +Procedure +--------- Step 1 ~~~~~~ +Setting up the model. + .. tab-set:: :sync-group: platform .. tab-item:: PyTorch :sync: torch - **Required imports** + **Import packages** .. literalinclude:: ../../legacy/torch_code_examples/mixed_precision.py :language: python @@ -155,7 +146,7 @@ Step 1 .. tab-item:: TensorFlow :sync: tf - **Required imports** + **Import packages** .. literalinclude:: ../../legacy/keras_code_examples/mixed_precision.py :language: python @@ -172,14 +163,14 @@ Step 1 .. tab-item:: ONNX :sync: onnx - **Required imports** + **Import packages** .. literalinclude:: ../../legacy/onnx_code_examples/mixed_precision.py :language: python :start-after: # Step 0. Import statements :end-before: # End step 0 - **Instantiate a PyTorch model, convert to ONNX graph, define forward_pass and evaluation callbacks** + **Instantiate a PyTorch model, convert to an ONNX graph, define forward_pass and evaluation callbacks** .. literalinclude:: ../../legacy/onnx_code_examples/mixed_precision.py :language: python @@ -189,6 +180,8 @@ Step 1 Step 2 ~~~~~~ +Quantizing the model. + .. tab-set:: :sync-group: platform diff --git a/Docs/featureguide/mixed precision/index.rst b/Docs/featureguide/mixed precision/index.rst index 1ca032aec25..54fae844e87 100644 --- a/Docs/featureguide/mixed precision/index.rst +++ b/Docs/featureguide/mixed precision/index.rst @@ -4,20 +4,14 @@ Mixed precision ############### -Quantization is a technique to improve the latency by running Deep Learning models in lower precision when -compared to full-precision floating point. Even though quantization helps achieve improved latency, store the model with -less memory and consume less power to run the models, it comes at a cost of reduced accuracy when compared to running -the model in Full Precision. The loss in accuracy is more pronounced as we run the model in lower bitwidths. -Mixed-Precision helps bridge the accuracy gap of quantized model when compared to floating point accuracy. In mixed -precision, different layers in the model are run in different precisions based on their sensitivity thereby getting the -benefit of higher accuracy but keeping the model size to be lower compared to full-precision floating point. +Quantization improves latency, uses less memory, and consumes less power to run a model, but it comes at the cost of reduced accuracy compared to full precision. The loss in accuracy becomes more pronounced the lower the bit width. Mixed precision helps bridge this accuracy gap. In mixed precision, sensitive layers in the model are run at higher precisions, achieving higher accuracy with a smaller model. -Mixed precision in AIMET currently follows the following steps, +Using mixed precision in AIMET follows these steps: -* Create the QuantSim object with a base precision -* Set the model to run in mixed precision by changing the bitwidth of relevant activation and param quantizers -* Calibrate and simulate the accuracy of the mixed precision model -* Export the artifacts which can be used by backend tools like QNN to run the model in mixed precision +1. Create a quantization simulation (QuantSim) object with a base precision. +2. Run the model in mixed precision by changing the bit width of selected activation and parameter quantizers. +3. Calibrate and simulate the accuracy of the mixed precision model. +4. Export configuration artifacts to create the mixed-precision model. .. toctree:: :hidden: @@ -25,14 +19,15 @@ Mixed precision in AIMET currently follows the following steps, Manual mixed precision Automatic mixed precision -:ref:`Manual mixed precision ` ------------------------------------------------- +AIMET offers two methods for creating a mixed-precision model: a manual mixed-precision configurator and automatic mixed precision. -Manual mixed precision (MMP) allows to set different precision levels (bit-width) to layers +Manual mixed precision +---------------------- + +:ref:`Manual mixed precision ` (MMP) enables different precision levels (bit width) in layers that are sensitive to quantization. -:ref:`Automatic mixed precision ` ---------------------------------------------------- +Automatic mixed precision +------------------------- -Auto mixed precision (AMP) will automatically find a minimal set of layers that need to -run on higher precision, to get to the desired quantized accuracy. +:ref:`Automatic mixed precision ` (AMP) automatically finds a minimal set of layers that require higher precision to achieve a desired quantized accuracy. diff --git a/Docs/featureguide/mixed precision/mmp.rst b/Docs/featureguide/mixed precision/mmp.rst index 6fa9e94d14d..83acd9f09ef 100644 --- a/Docs/featureguide/mixed precision/mmp.rst +++ b/Docs/featureguide/mixed precision/mmp.rst @@ -7,23 +7,25 @@ Manual mixed precision Context ======= -To set the model in mixed precision, AIMET user would have to find the correct quantizer(s) and change to the new -settings. This requires complex graph traversals which are error prone. Manual Mixed Precision (MMP) Configurator hides -this issue by providing easy to use APIs to configure the model in mixed precision. User can change the precision of a -layer by directly specifying the layer and the intended precision. User would also get a report to analyze how it was achieved. +To effectively use mixed precision, you must find the correct quantizers to run at higher precision settings. This requires complex, error-prone graph traversals. The AIMET manual mixed precision (MMP) configurator hides this issue by providing easy-to-use APIs to configure the model in mixed precision. You can change the precision of a layer by directly specifying the layer and the intended precision. MMP configurator also analyzes and reports how the mixed precision was achieved. -MMP configurator provides the following mechanisms to change the precision in a model +MMP configurator enables you to change the precision of the following within a model: -* Change the precision of a leaf layer -* Change the precision of a non-leaf layer (layer composed of multiple leaf layers) -* Change the precision of all the layers in the model of a certain type -* Change the precision of model input tensors (or only a subset of input tensors) -* Change the precision of model output tensors (or only a subset of output tensors) +* A leaf layer +* A non-leaf layer (a layer composed of multiple leaf layers) +* All layers of a certain type +* Model input tensors or a subset of input tensors +* Model output tensors or a subset of output tensors Workflow ======== +Prerequisites +------------- + +Manual mixed precision is supported only on PyTorch models. + Setup ----- @@ -38,19 +40,30 @@ Setup :start-after: [setup] :end-before: [set_precision_leaf] -MMP API options ---------------- + .. tab-item:: TensorFlow + :sync: tf + + Not supported. + + .. tab-item:: ONNX + :sync: onnx + + Not supported. -MMP provides the following APIs to change the precision. The APIs can be called in any order. But, in case of conflicts, latest request will triumph the older request. + +Step 1: Applying MMP API options +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. note:: - The requests are processed using the leaf layers in the model + All requests are processed using the leaf layers in the model. + +MMP provides the following APIs to change layers' precision. The APIs can be called in any order. In case of conflicts, the latest request overrides an older request. For example: -* If one of the below APIs is called multiple times for the same layer but with a different precision in each of those calls, only the latest one would be serviced -* This rule holds good even if the requests are from two different APIs ie if user calls a non-leaf layer (L1) with precision (P1) and a leaf layer inside L1 (L2) with precision (P2). This would be serviced by setting all the layers in L1 at P1 precision, except layer L2 which would be set at P2 precision. +* If one of the following APIs is called multiple times but with a different precision for the same layer, only the latest call is serviced. +* The last request takes precedence even if the requests are from two different APIs. For example, say you call a non-leaf layer L1 with precision P1 and then a leaf layer L2, inside L1, with precision P2. This sets all the layers in L1 to precision P1, except layer L2 which is set to P2. Set precision of a leaf layer ------------------------------ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. tab-set:: :sync-group: platform @@ -63,9 +76,19 @@ Set precision of a leaf layer :start-after: [set_precision_leaf] :end-before: [set_precision_non_leaf] + .. tab-item:: TensorFlow + :sync: tf + + Not supported. + + .. tab-item:: ONNX + :sync: onnx + + Not supported. + Set precision of a non-leaf layer ---------------------------------- +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. tab-set:: :sync-group: platform @@ -78,9 +101,19 @@ Set precision of a non-leaf layer :start-after: [set_precision_non_leaf] :end-before: [set_precision_type] + .. tab-item:: TensorFlow + :sync: tf + + Not supported. + + .. tab-item:: ONNX + :sync: onnx + + Not supported. + Set precision based on layer type ---------------------------------- +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. tab-set:: :sync-group: platform @@ -93,8 +126,19 @@ Set precision based on layer type :start-after: [set_precision_type] :end-before: [set_precision_model_input] + .. tab-item:: TensorFlow + :sync: tf + + Not supported. + + .. tab-item:: ONNX + :sync: onnx + + Not supported. + + Set model input precision -------------------------- +~~~~~~~~~~~~~~~~~~~~~~~~~ .. tab-set:: :sync-group: platform @@ -107,10 +151,20 @@ Set model input precision :start-after: [set_precision_model_input] :end-before: [set_precision_model_output] -* Do note that if a model has more than one input tensor (say the structure is [In1, In2]), but only one of them (say In2) needs to be configured to a new precision (say P1), user can achieve it by setting ``activation=[None, P1]`` in the above API + .. tab-item:: TensorFlow + :sync: tf + + Not supported. + + .. tab-item:: ONNX + :sync: onnx + + Not supported. + +If a model has more than one input tensor (for example, the structure is [In1, In2]), you can set just one of them (say In2) to a new precision (say P1) by setting ``activation=[None, P1]`` in the above API. Set model output precision --------------------------- +~~~~~~~~~~~~~~~~~~~~~~~~~~ .. tab-set:: :sync-group: platform @@ -123,12 +177,23 @@ Set model output precision :start-after: [set_precision_model_output] :end-before: [apply] -* Do note that if a model has more than one output tensor (say the structure is [Out1, Out2, Out3]), but only one of them (say Out2) needs to be configured to a new precision (say P1), user can achieve it by setting ``activation=[None, P1, None]`` in the above API + .. tab-item:: TensorFlow + :sync: tf + + Not supported. -Apply the profile ------------------ + .. tab-item:: ONNX + :sync: onnx -All the above `set precision` family of calls would be processed at once when the below ``apply(...)`` API is called + Not supported. + +If a model has more than one output tensor (for example, the structure is [Out1, Out2, Out3]), you can set just one of them (say Out2) to a new precision (say P1) by setting ``activation=[None, P1, None]`` in the above API. + + +Step 2: Applying the profile +---------------------------- + +All of the `set precision` family of calls from step 1 are processed at once when the following ``apply(...)`` API is called. .. tab-set:: :sync-group: platform @@ -140,8 +205,18 @@ All the above `set precision` family of calls would be processed at once when th :language: python :start-after: [apply] -.. note:: - The above call would generate a report detailing how a user's request was inferred, propagated to other layers and realized eventually + .. tab-item:: TensorFlow + :sync: tf + + Not supported. + + .. tab-item:: ONNX + :sync: onnx + + Not supported. + + +The ``apply`` call generates a report detailing how the request was inferred, propagated to other layers, and eventually realized. API === @@ -155,3 +230,14 @@ API .. include:: ../../apiref/torch/mp.rst :start-after: # start-after mmp :end-before: # end-before mmp + + .. tab-item:: TensorFlow + :sync: tf + + Not supported. + + .. tab-item:: ONNX + :sync: onnx + + Not supported. +