High-Performance DiT (Diffusion Transformer) Inference Framework for Video & Image Generation
kDiT is a high-performance inference framework specifically designed for Diffusion Transformers (DiT), supporting video generation (T2V/I2V) and image generation (T2I) tasks. The framework provides a rich set of optimization techniques and flexible configuration options, enabling efficient execution of large-scale DiT models on single or multi-GPU environments.
- ๐ High-Performance Inference: FP8 quantization, QKV Fuse, Torch Compile, and various attention optimizations
- ๐ฏ Multiple Attention Backends: SLA Attention, Flash Attention, Sage Attention, Radial Sage Attention, Torch SDPA
- ๐ฌ Multi-Modal Generation: Text-to-Video (T2V), Image-to-Video (I2V), Video Controllable Editing (Vace), Text-to-Image (T2I)
- ๐พ Smart Caching: Built-in caching strategies (DBCache, EasyCache, MagCache, TeaCache, CustomStepCache, HybridCache)
- ๐ง Flexible Configuration: LoRA support, multiple samplers (Euler, UniPC, DPM++), custom sigma scheduling
- ๐ Distributed Support: Single-GPU, multi-GPU (torchrun), Ray distributed inference, Model Pool management
- ๐ ComfyUI Integration: ComfyUI node support (standalone submodule) for visual workflow design
- ๐ฅ๏ธ Multi-Platform Support: GPU, NPU, XPU (WIP)
| Model | Type | Parameters | Tasks | Status |
|---|---|---|---|---|
| Turbo Diffusion | Image-to-Video | 14B | I2V | โ |
| Wan2.2-T2V | Text-to-Video | 5B/14B | T2V | โ |
| Wan2.2-I2V | Image-to-Video | 14B | I2V | โ |
| Wan2.1-Vace | Video Controllable Editing | 14B | Vace | โ |
| Model | Type | Parameters | Tasks | Status |
|---|---|---|---|---|
| Qwen-Image | Text-to-Image | 20B | T2I | โ |
| Qwen-Image Edit | Image Editing | 20B | Image Edit | โ |
We are actively working on Dockerfiles. Stay tuned!
- Python: >= 3.10, < 4.0
- PyTorch: >= 2.0
- GPU Environment:
- CUDA >= 12.8
- Recommended: NVIDIA GPUs
- NPU Environment:
- CANN >= 8.0
- torch_npu adapter
# Clone the repository
git clone https://github.com/Tencent/KsanaDiT.git
cd KsanaDiT
# Run the installation script (automatically handles all dependencies)
bash scripts/install_public.shThe installation script will automatically detect your hardware environment and install the appropriate dependencies.
kDiT provides multiple usage methods to meet different scenario requirements:
Run locally through the Python Pipeline API, suitable for scripted batch generation or integration into your own systems:
from kdit import Pipeline
# Create inference pipeline
pipeline = Pipeline.from_models("path/to/model")
# Generate video/image
result = pipeline.generate(prompt, ...)For detailed usage, refer to Quick Start and the examples directory.
kDiT supports usage as ComfyUI custom nodes, providing a visual workflow experience:
# 1. Clone the kDiT repository
git clone https://github.com/Tencent/KsanaDiT.git
# 2. Enter the kDiT directory and run the install script
cd KsanaDiT
./scripts/install_public.shDuring installation, the script will interactively prompt you to enter the ComfyUI installation root directory. After installation, restart ComfyUI and you will see kDiT-related nodes in the node list.
For detailed code examples, refer to examples.
import torch
from kdit import Pipeline
from kdit.config import (
DistributedConfig,
RuntimeConfig,
SampleConfig,
)
# Create inference pipeline
pipeline = Pipeline.from_models(
"path/to/Wan2.2-T2V-A14B",
dist_config=DistributedConfig(num_gpus=1)
)
# Generate video
video = pipeline.generate(
"Street photography, cool girl with headphones skateboarding, New York streets, graffiti wall background",
sample_config=SampleConfig(steps=40),
runtime_config=RuntimeConfig(
seed=1234,
size=(720, 480),
frame_num=17,
return_frames=True,
),
)
print(f"Generated video shape: {video.shape}")from kdit import Pipeline
from kdit.config import RuntimeConfig, SampleConfig
from kdit.pipelines.context_builders.wan import WanI2VExtraInputs
pipeline = Pipeline.from_models("path/to/Wan2.2-I2V-A14B")
video = pipeline.generate(
"Girl gently waves her fan, blows a breath of fairy air, lightning flies from her hand into the sky and thunder begins",
extra_inputs=WanI2VExtraInputs(start_img_path="input.png"),
sample_config=SampleConfig(steps=40),
runtime_config=RuntimeConfig(
seed=1234,
size=(512, 512),
frame_num=17,
),
)import torch
from kdit import Pipeline
from kdit.config import (
ModelConfig,
RuntimeConfig,
SampleConfig,
SolverType,
)
pipeline = Pipeline.from_models(
"path/to/Qwen-Image",
model_config=ModelConfig(run_dtype=torch.bfloat16),
)
image = pipeline.generate(
"A cute orange cat sitting on a windowsill, sunlight streaming through the window onto its fur",
sample_config=SampleConfig(
steps=20,
cfg_scale=4.0,
solver=SolverType.FLOWMATCH_EULER,
),
runtime_config=RuntimeConfig(
seed=42,
size=(1024, 1024),
),
)import torch
from kdit import Pipeline
from kdit.config import (
ModelConfig,
KsanaAttentionConfig,
KsanaAttentionBackend,
KsanaLinearBackend,
)
model_config = ModelConfig(
run_dtype=torch.float16,
attention_config=KsanaAttentionConfig(backend=KsanaAttentionBackend.SAGE_ATTN),
linear_backend=KsanaLinearBackend.FP8_GEMM,
)
pipeline = Pipeline.from_models(
("high_noise_fp8.safetensors", "low_noise_fp8.safetensors"),
model_config=model_config,
)from kdit import Pipeline
from kdit.config import LoraConfig, SampleConfig
pipeline = Pipeline.from_models(
"path/to/Wan2.2-T2V-A14B",
lora_config=LoraConfig("path/to/Wan2.2-Lightning-4steps-lora"),
)
# Fast generation with 4 steps
video = pipeline.generate(
prompt,
sample_config=SampleConfig(
steps=4,
cfg_scale=1.0,
sigmas=[1.0, 0.9375, 0.6333, 0.225, 0.0],
),
)from kdit.config.cache_config import (
DCacheConfig,
DBCacheConfig,
HybridCacheConfig,
)
# Use hybrid caching strategy
cache_config = HybridCacheConfig(
step_cache=DCacheConfig(fast_degree=50),
block_cache=DBCacheConfig(),
)
video = pipeline.generate(
prompt,
cache_config=cache_config,
)# Method 1: Using CUDA_VISIBLE_DEVICES
CUDA_VISIBLE_DEVICES=0,1,2,3 python your_script.py
# Method 2: Using torchrun
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 your_script.pyfrom kdit import Pipeline
from kdit.config import DistributedConfig
pipeline = Pipeline.from_models(
model_path,
dist_config=DistributedConfig(num_gpus=4),
)| Technique | Description | Effect |
|---|---|---|
| FP8 GEMM | FP8 quantized matrix multiplication | Reduced memory, improved speed |
| Torchao FP8 Dynamic | Dynamic FP8 quantization | Adaptive precision, balanced quality and performance |
| QKV Fuse | QKV projection fusion | Reduced memory access, improved throughput |
| torch.compile | Graph compilation optimization | 10-30% end-to-end speedup |
| Backend | Characteristics | Use Case |
|---|---|---|
| Flash Attention | High performance, memory efficient | General recommendation |
| Sage Attention | Optimized attention computation | Long sequences |
| Sage SLA | Top-k sparse attention | Turbo Diffusion |
| Radial Sage Attention | Radial sparse attention | Very long sequences |
| Torch SDPA | PyTorch native implementation | Compatibility priority |
| Strategy | Description | Use Case |
|---|---|---|
| DCache | Step-level caching with degree-based polynomial | General video generation |
| TeaCache | Temporal-aware step-level caching | Video generation optimization |
| MagCache | Adaptive step-level caching | Balanced quality and speed |
| EasyCache | Lightweight step-level caching without pre-prepared parameters | Fast inference with minimal overhead |
| DBCache | Block-level caching | Image generation |
| HybridCache | Step-level + block-level hybrid caching | Maximum acceleration |
| Sampler | Description | Use Case |
|---|---|---|
| Euler | Fast sampling | 4-8 step inference |
| UniPC | High-quality sampling | 20-40 step inference |
| DPM++ | Efficient multi-step sampling | General purpose |
| Turbo Diffusion | Ultra-fast sampling | 4-step inference |
| FlowMatch Euler | Flow matching sampling | Image generation |
# Log level: debug/info/warn/error
export KSANA_LOGGER_LEVEL=info
The framework supports model parameter configuration via YAML files, located in the kdit/settings/ directory:
qwen/t2i_20b.yaml- Qwen image generation model configqwen/edit_20b.yaml- Qwen image editing model configwan/t2v_14b.yaml- Wan2.2 T2V model configwan/ti2v_5b.yaml- Wan2.2 TI2V 5B model configwan/i2v_14b.yaml- Wan2.2 I2V model configwan/vace_14b.yaml- Wan2.1 Vace model config
Complete example code is available in the examples/ directory:
examples/local/wan/wan2_2_t2v.py- Text-to-Video exampleexamples/local/wan/wan2_2_i2v.py- Image-to-Video exampleexamples/local/wan/wan2_1_vace.py- Video controllable editing exampleexamples/local/qwen/qwen_image_t2i.py- Text-to-Image exampleexamples/local/qwen/qwen_image_edit.py- Image Editing example
We have comprehensive test coverage. Tests are currently time-consuming; we will continue to streamline them. For developers only.
# Run all tests
pytest tests/
# Run specific tests
pytest tests/kdit/pipelines/wan2_2_t2v_test.py
# Run GPU tests
bash scripts/ci_tests/ci_kdit_gpus.shWe welcome community contributions! Before submitting a PR, please ensure:
- Code passes all tests
- Follows project code style (using
git commithook) - Includes necessary documentation and comments
- Updates relevant README and examples
# Install development dependencies
pip install -e ".[dev]"
# Run code style checks
pre-commit run --all-files
# Run tests
pytest tests/For a detailed list of changes in each version, see the CHANGELOG.
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
This project benefits from the following excellent open-source projects:
- Wan-Video - Wan2.2 video generation model
- ComfyUI-WanVideoWrapper - ComfyUI integration reference
- FastVideo - Video generation optimization techniques
- Nunchaku - Quantization optimization solutions
- TurboDiffusion - Inference acceleration solutions
- Bug Reports: GitHub Issues
- Feature Requests: GitHub Discussions
- Multi-Platform Support: GPU, NPU, XPU backend support
- Batch Inference: Support for batch size > 1, merged cond/uncond
- Video Editing: Wan2.1 Vace video controllable editing
- Advanced Samplers: DPM++, Turbo Diffusion support
- Performance Optimization: QKV Fuse + Dynamic FP8 optimization
- Memory Optimization: Pin Manager to resolve OOM issues
- Smart Caching: MagCache, TeaCache, EasyCache strategies
- Image Editing: Qwen Image Edit model support
- VAE Parallelism: Multi-GPU VAE decoding
- Monitoring: Inference metrics reporting
- Support for more generation models (Z-Image, Hunyuan, etc.)
- Memory optimization for longer video generation
- Cache strategy performance tuning
- Model quantization toolchain
- XPU full feature support optimization
If this project helps you, please give us a โญ๏ธ Star!
Made with โค๏ธ by the kDiT Team