Skip to content

Commit 717c39e

Browse files
Example scripts revamp (#615)
1. Restructured examples folder - Organized examples into domain-specific subdirectories: `text_generation/`, `image_text_to_text/`, `embeddings/`, `audio/`, `peft/`, `performance/` - Each subdirectory contains focused examples with dedicated READMEs 2. New CONTRIBUTING.md - Guidelines for when to add new examples - Code templates and file requirements - Complete Git workflow with DCO sign-off - Pre-commit checks and PR submission process --------- Signed-off-by: Rishin Raj <[email protected]> Co-authored-by: Abukhoyer Shaik <[email protected]>
1 parent 30c334b commit 717c39e

File tree

77 files changed

+3047
-727
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

77 files changed

+3047
-727
lines changed

QEfficient/cloud/execute.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -115,7 +115,7 @@ def main(
115115
"--prompts_txt_file_path",
116116
"--prompts-txt-file-path",
117117
type=str,
118-
help="File path for taking input prompts from txt file, sample prompts.txt file present in examples folder",
118+
help="File path for taking input prompts from txt file, sample prompts.txt file present in examples/sample_prompts folder",
119119
)
120120
parser.add_argument("--generation_len", "--generation-len", type=int, help="Number of tokens to generate")
121121
parser.add_argument(

QEfficient/cloud/infer.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -390,7 +390,7 @@ def main(
390390
"--prompts_txt_file_path",
391391
"--prompts-txt-file-path",
392392
type=str,
393-
help="File path for taking input prompts from txt file, sample prompts.txt file present in examples folder",
393+
help="File path for taking input prompts from txt file, sample prompts.txt file present in examples/sample_prompts folder",
394394
)
395395
parser.add_argument("--generation_len", "--generation-len", type=int, help="Number of tokens to generate")
396396
parser.add_argument(

docs/source/quick_start.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -125,10 +125,10 @@ You can pass input prompts in single string but separate with pipe (|) symbol".
125125
python -m QEfficient.cloud.infer --model_name gpt2 --batch_size 3 --prompt_len 32 --ctx_len 128 --num_cores 16 --device_group [0] --prompt "My name is|The flat earth theory is the belief that|The sun rises from" --mxfp6 --mos 1 --aic_enable_depth_first
126126
```
127127

128-
You can also pass path of txt file with input prompts when you want to run inference on lot of prompts, Example below, sample txt file(prompts.txt) is present in examples folder.
128+
You can also pass path of txt file with input prompts when you want to run inference on lot of prompts, Example below, sample txt file(prompts.txt) is present in examples/sample_prompts folder.
129129

130130
```bash
131-
python -m QEfficient.cloud.infer --model_name gpt2 --batch_size 3 --prompt_len 32 --ctx_len 128 --num_cores 16 --device_group [0] --prompts_txt_file_path examples/prompts.txt --mxfp6 --mos 1 --aic_enable_depth_first
131+
python -m QEfficient.cloud.infer --model_name gpt2 --batch_size 3 --prompt_len 32 --ctx_len 128 --num_cores 16 --device_group [0] --prompts_txt_file_path examples/sample_prompts/prompts.txt --mxfp6 --mos 1 --aic_enable_depth_first
132132
```
133133
**QNN CLI Inference Command**
134134

docs/source/release_docs.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ Welcome to the official release of **Efficient Transformer Library v1.20.0**! Th
1313
- Text & Image+Text support
1414
- Chunk attention, Single/Dual QPC support
1515
- Multi-image prompts enabled via VLLM interface
16-
- [Llama4 Example Script](https://github.com/quic/efficient-transformers/blob/main/examples/llama4_example.py)
16+
- [Llama4 Example Script](https://github.com/quic/efficient-transformers/blob/main/examples/image_text_to_text/models/llama_vision/single_image.py)
1717

1818
- **Grok-1**
1919
- Executable via [`QEffAutoModelForCausalLM`](#QEffAutoModelForCausalLM)
@@ -22,7 +22,7 @@ Welcome to the official release of **Efficient Transformer Library v1.20.0**! Th
2222
- Executable via [`QEFFAutoModelForImageTextToText`](#QEFFAutoModelForImageTextToText)
2323
- Text & Image+Text support
2424
- Sliding window support
25-
- [Gemma3 Example Script](https://github.com/quic/efficient-transformers/blob/main/examples/gemma3_example/gemma3_mm.py)
25+
- [Gemma3 Example Script](https://github.com/quic/efficient-transformers/blob/main/examples/image_text_to_text/models/gemma_vision/inference.py)
2626

2727

2828
- **SwiftKV (Llama-3.1-SwiftKV-8B-Instruct)**
@@ -32,7 +32,7 @@ Welcome to the official release of **Efficient Transformer Library v1.20.0**! Th
3232
- **GGUF Models**
3333
- Executable via [`QEffAutoModelForCausalLM`](#QEffAutoModelForCausalLM)
3434
- Execution support (non-quantized)
35-
- [Example Script](https://github.com/quic/efficient-transformers/blob/main/examples/basic_gguf_models.py)
35+
- [Example Script](https://github.com/quic/efficient-transformers/blob/main/examples/text_generation/gguf_models.py)
3636

3737
- **FP8 Compressed Quantization**
3838
- Support for [`Llama-3.3-70B-Instruct-FP8-Dynamic`](https://huggingface.co/Infermatic/Llama-3.3-70B-Instruct-FP8-Dynamic)

docs/source/supported_features.rst

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -6,46 +6,48 @@ Supported Features
66

77
* - Feature
88
- Impact
9+
* - `Compute Context Length (CCL) <https://github.com/quic/efficient-transformers/blob/main/examples/performance/compute_context_length/README.md>`_
10+
- Optimizes inference by using different context lengths during prefill and decode phases, reducing memory footprint and computation for shorter sequences while maintaining support for longer contexts. Supports both text-only and vision-language models. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/performance/compute_context_length/basic_inference.py>`_ for more **details**.
911
* - Sentence embedding, Flexible Pooling configuration and compilation with multiple sequence lengths
10-
- Supports standard/custom pooling with AI 100 acceleration and sentence embedding. Enables efficient sentence embeddings via Efficient-Transformers. Compile with one or multiple seq_len; optimal graph auto-selected at runtime. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/embedding_model.py>`_ for more **details**.
12+
- Supports standard/custom pooling with AI 100 acceleration and sentence embedding. Enables efficient sentence embeddings via Efficient-Transformers. Compile with one or multiple seq_len; optimal graph auto-selected at runtime. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/embeddings/sentence_embeddings.py>`_ for more **details**.
1113
* - `SpD, multiprojection heads <https://quic.github.io/efficient-transformers/source/quick_start.html#draft-based-speculative-decoding>`_
12-
- Implemented post-attention hidden size projections to speculate tokens ahead of the base model. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/multiprojs_spd_inference.py>`_ for more **details**.
14+
- Implemented post-attention hidden size projections to speculate tokens ahead of the base model. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/performance/speculative_decoding/multi_projection.py>`_ for more **details**.
1315
* - `QNN Compilation support <https://github.com/quic/efficient-transformers/pull/374>`_
1416
- Enabled for AutoModel classes QNN compilation capabilities for multi-models, embedding models and causal models.
1517
* - `Disaggregated serving <https://github.com/quic/efficient-transformers/pull/365>`_
1618
- It support for separate prefill and decode compilation for encoder (vision) and language models.
1719
* - `GGUF model execution <https://github.com/quic/efficient-transformers/pull/368>`_
18-
- Supported GGUF model execution (without quantized weights). Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/basic_gguf_models.py>`_ for more **details**.
20+
- Supported GGUF model execution (without quantized weights). Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/text_generation/gguf_models.py>`_ for more **details**.
1921
* - Replication of KV
2022
- Enabled FP8 model support on `replicate_kv_heads script <https://github.com/quic/efficient-transformers/tree/main/scripts/replicate_kv_head>`_.
2123
* - `gradient checkpointing <https://github.com/quic/efficient-transformers/pull/338>`_
2224
- Supports gradient checkpointing in the finetuning script
2325
* - Swift KV `Snowflake/Llama-3.1-SwiftKV-8B-Instruct <https://huggingface.co/Snowflake/Llama-3.1-SwiftKV-8B-Instruct>`_
2426
- Reduces computational overhead during inference by optimizing key-value pair processing, leading to improved throughput. Support for both `continuous and non-continuous batching execution <https://github.com/quic/efficient-transformers/pull/367>`_ in SwiftKV
2527
* - :ref:`Vision Language Model <QEFFAutoModelForImageTextToText>`
26-
- Provides support for the AutoModelForImageTextToText class from the transformers library, enabling advanced vision-language tasks. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/image_text_to_text_inference.py>`_ for more **details**.
28+
- Provides support for the AutoModelForImageTextToText class from the transformers library, enabling advanced vision-language tasks. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/image_text_to_text/basic_vlm_inference.py>`_ for more **details**.
2729
* - :ref:`Speech Sequence to Sequence Model <QEFFAutoModelForSpeechSeq2Seq>`
28-
- Provides support for the QEFFAutoModelForSpeechSeq2Seq Facilitates speech-to-text sequence models. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/speech_to_text/run_whisper_speech_to_text.py>`_ for more **details**.
30+
- Provides support for the QEFFAutoModelForSpeechSeq2Seq Facilitates speech-to-text sequence models. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/audio/speech_to_text.py>`_ for more **details**.
2931
* - Support for FP8 Execution
3032
- Enables execution with FP8 precision, significantly improving performance and reducing memory usage for computational tasks.
3133
* - Prefill caching
3234
- Enhances inference speed by caching key-value pairs for shared prefixes, reducing redundant computations and improving efficiency.
3335
* - On Device Sampling
3436
- Enables sampling operations to be executed directly on the QAIC device rather than the host CPU for QEffForCausalLM models. This enhancement significantly reduces host-device communication overhead and improves inference throughput and scalability. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/on_device_sampling.py>`_ for more **details**.
3537
* - Prompt-Lookup Decoding
36-
- Speeds up text generation by using overlapping parts of the input prompt and the generated text, making the process faster without losing quality. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/pld_spd_inference.py>`_ for more **details**.
38+
- Speeds up text generation by using overlapping parts of the input prompt and the generated text, making the process faster without losing quality. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/performance/speculative_decoding/prompt_lookup.py>`_ for more **details**.
3739
* - :ref:`PEFT LoRA support <QEffAutoPeftModelForCausalLM>`
38-
- Enables parameter-efficient fine-tuning using low-rank adaptation techniques, reducing the computational and memory requirements for fine-tuning large models. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/peft_models.py>`_ for more **details**.
40+
- Enables parameter-efficient fine-tuning using low-rank adaptation techniques, reducing the computational and memory requirements for fine-tuning large models. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/peft/single_adapter.py>`_ for more **details**.
3941
* - :ref:`QNN support <id-qnn-compilation-via-python-api>`
4042
- Enables compilation using QNN SDK, making Qeff adaptable for various backends in the future.
4143
* - :ref:`Embedding model support <QEFFAutoModel>`
4244
- Facilitates the generation of vector embeddings for retrieval tasks.
4345
* - :ref:`Speculative Decoding <id-draft-based-speculative-decoding>`
44-
- Accelerates text generation by using a draft model to generate preliminary predictions, which are then verified by the target model, reducing latency and improving efficiency. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/draft_spd_inference.py>`_ for more **details**.
46+
- Accelerates text generation by using a draft model to generate preliminary predictions, which are then verified by the target model, reducing latency and improving efficiency. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/performance/speculative_decoding/draft_based.py>`_ for more **details**.
4547
* - :ref:`Finite lorax <QEffAutoLoraModelForCausalLM>`
46-
- Users can activate multiple LoRA adapters and compile them with the base model. At runtime, they can specify which prompt should use which adapter, enabling mixed adapter usage within the same batch. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/lora_models.py>`_ for more **details**.
48+
- Users can activate multiple LoRA adapters and compile them with the base model. At runtime, they can specify which prompt should use which adapter, enabling mixed adapter usage within the same batch. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/peft/multi_adapter.py>`_ for more **details**.
4749
* - Python and CPP Inferencing API support
48-
- Provides flexibility while running inference with Qeff and enabling integration with various applications and improving accessibility for developers. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/cpp_execution/text_inference_using_cpp.py>`_ for more **details**.
50+
- Provides flexibility while running inference with Qeff and enabling integration with various applications and improving accessibility for developers. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/examples/performance/cpp_execution/text_inference_cpp.py>`_ for more **details**.
4951
* - :ref:`Continuous batching <id-continuous-batching>`
5052
- Optimizes throughput and latency by dynamically batching requests, ensuring efficient use of computational resources.
5153
* - AWQ and GPTQ support
@@ -56,7 +58,5 @@ Supported Features
5658
- A script for computing the perplexity of a model, allowing for the evaluation of model performance and comparison across different models and datasets. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/scripts/perplexity_computation/calculate_perplexity.py>`_ for more **details**.
5759
* - KV Heads Replication Script
5860
- A sample script for replicating key-value (KV) heads for the Llama-3-8B-Instruct model, running inference with the original model, replicating KV heads, validating changes, and exporting the modified model to ONNX format. Refer `sample script <https://github.com/quic/efficient-transformers/blob/main/scripts/replicate_kv_head/replicate_kv_heads.py>`_ for more **details**.
59-
* - Context Length Specializations (upcoming)
60-
- Increases the maximum context length that models can handle, allowing for better performance on tasks requiring long sequences of text.
6161
* - Block Attention (in progress)
62-
- Reduces inference latency and computational cost by dividing context into blocks and reusing key-value states, particularly useful in RAG.
62+
- Reduces inference latency and computational cost by dividing context into blocks and reusing key-value states, particularly useful in RAG.

0 commit comments

Comments
 (0)