Heterogeneously run the LLaMA model on both the QNN and XNNPACK backends. #13629
Replies: 1 comment 2 replies
-
@yujiaoliang For QNN, specifically, you can instruct the QNN partitioner to skip specific node IDs or operators, which will allow them to fall back to XNNPACK. See the QNN partitioner args here - https://www.internalfb.com/code/fbsource/[3369a2d3a668]/fbcode/executorch/backends/qualcomm/partition/qnn_partitioner.py?lines=135. You can then pass both the QnnPartitioner and XnnpackPartitioner to_edge_transform_and_lower. The second partitioner will act as a fallback. to_edge_transform_and_lower(
ep,
partitioner=[qnn_partitioner, xnnpack_partitioner]
) You can also provide a custom partitioner for advanced use cases, but it will require a bit of coding. There is an example in https://docs.pytorch.org/executorch/main/compiler-delegate-and-partitioner.html#common-questions under "5. Can we delegate to multiple backends?". |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I’m planning to deploy the quantized LLaMA 3.2-3B model on QNN and run some of its linear layers on XNNPACK. Would this be possible?
Is this kind of setup supported at the moment?
Beta Was this translation helpful? Give feedback.
All reactions