Skip to content

Tencent/KsanaDiT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

232 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

kDiT

High-Performance DiT (Diffusion Transformer) Inference Framework for Video & Image Generation

License Python PyTorch

English | ็ฎ€ไฝ“ไธญๆ–‡

๐Ÿ“– Introduction

kDiT is a high-performance inference framework specifically designed for Diffusion Transformers (DiT), supporting video generation (T2V/I2V) and image generation (T2I) tasks. The framework provides a rich set of optimization techniques and flexible configuration options, enabling efficient execution of large-scale DiT models on single or multi-GPU environments.

โœจ Key Features

  • ๐Ÿš€ High-Performance Inference: FP8 quantization, QKV Fuse, Torch Compile, and various attention optimizations
  • ๐ŸŽฏ Multiple Attention Backends: SLA Attention, Flash Attention, Sage Attention, Radial Sage Attention, Torch SDPA
  • ๐ŸŽฌ Multi-Modal Generation: Text-to-Video (T2V), Image-to-Video (I2V), Video Controllable Editing (Vace), Text-to-Image (T2I)
  • ๐Ÿ’พ Smart Caching: Built-in caching strategies (DBCache, EasyCache, MagCache, TeaCache, CustomStepCache, HybridCache)
  • ๐Ÿ”ง Flexible Configuration: LoRA support, multiple samplers (Euler, UniPC, DPM++), custom sigma scheduling
  • ๐ŸŒ Distributed Support: Single-GPU, multi-GPU (torchrun), Ray distributed inference, Model Pool management
  • ๐Ÿ”Œ ComfyUI Integration: ComfyUI node support (standalone submodule) for visual workflow design
  • ๐Ÿ–ฅ๏ธ Multi-Platform Support: GPU, NPU, XPU (WIP)

๐Ÿ“ฆ Supported Models

Video Generation Models

Model Type Parameters Tasks Status
Turbo Diffusion Image-to-Video 14B I2V โœ…
Wan2.2-T2V Text-to-Video 5B/14B T2V โœ…
Wan2.2-I2V Image-to-Video 14B I2V โœ…
Wan2.1-Vace Video Controllable Editing 14B Vace โœ…

Image Generation Models

Model Type Parameters Tasks Status
Qwen-Image Text-to-Image 20B T2I โœ…
Qwen-Image Edit Image Editing 20B Image Edit โœ…

๐Ÿ› ๏ธ Installation

Docker

We are actively working on Dockerfiles. Stay tuned!

Requirements

  • Python: >= 3.10, < 4.0
  • PyTorch: >= 2.0
  • GPU Environment:
    • CUDA >= 12.8
    • Recommended: NVIDIA GPUs
  • NPU Environment:
    • CANN >= 8.0
    • torch_npu adapter

Installation Steps

# Clone the repository
git clone https://github.com/Tencent/KsanaDiT.git
cd KsanaDiT

# Run the installation script (automatically handles all dependencies)
bash scripts/install_public.sh

The installation script will automatically detect your hardware environment and install the appropriate dependencies.

๐Ÿ”Œ Interface Support

kDiT provides multiple usage methods to meet different scenario requirements:

Local Pipeline Mode

Run locally through the Python Pipeline API, suitable for scripted batch generation or integration into your own systems:

from kdit import Pipeline

# Create inference pipeline
pipeline = Pipeline.from_models("path/to/model")

# Generate video/image
result = pipeline.generate(prompt, ...)

For detailed usage, refer to Quick Start and the examples directory.

ComfyUI Integration

kDiT supports usage as ComfyUI custom nodes, providing a visual workflow experience:

# 1. Clone the kDiT repository
git clone https://github.com/Tencent/KsanaDiT.git

# 2. Enter the kDiT directory and run the install script
cd KsanaDiT
./scripts/install_public.sh

During installation, the script will interactively prompt you to enter the ComfyUI installation root directory. After installation, restart ComfyUI and you will see kDiT-related nodes in the node list.

๐Ÿš€ Quick Start

For detailed code examples, refer to examples.

Text-to-Video (T2V)

import torch
from kdit import Pipeline
from kdit.config import (
    DistributedConfig,
    RuntimeConfig,
    SampleConfig,
)

# Create inference pipeline
pipeline = Pipeline.from_models(
    "path/to/Wan2.2-T2V-A14B",
    dist_config=DistributedConfig(num_gpus=1)
)

# Generate video
video = pipeline.generate(
    "Street photography, cool girl with headphones skateboarding, New York streets, graffiti wall background",
    sample_config=SampleConfig(steps=40),
    runtime_config=RuntimeConfig(
        seed=1234,
        size=(720, 480),
        frame_num=17,
        return_frames=True,
    ),
)

print(f"Generated video shape: {video.shape}")

Image-to-Video (I2V)

from kdit import Pipeline
from kdit.config import RuntimeConfig, SampleConfig
from kdit.pipelines.context_builders.wan import WanI2VExtraInputs

pipeline = Pipeline.from_models("path/to/Wan2.2-I2V-A14B")

video = pipeline.generate(
    "Girl gently waves her fan, blows a breath of fairy air, lightning flies from her hand into the sky and thunder begins",
    extra_inputs=WanI2VExtraInputs(start_img_path="input.png"),
    sample_config=SampleConfig(steps=40),
    runtime_config=RuntimeConfig(
        seed=1234,
        size=(512, 512),
        frame_num=17,
    ),
)

Turbo Diffusion

See run_turbo_diffusion

Text-to-Image (T2I)

import torch
from kdit import Pipeline
from kdit.config import (
    ModelConfig,
    RuntimeConfig,
    SampleConfig,
    SolverType,
)

pipeline = Pipeline.from_models(
    "path/to/Qwen-Image",
    model_config=ModelConfig(run_dtype=torch.bfloat16),
)

image = pipeline.generate(
    "A cute orange cat sitting on a windowsill, sunlight streaming through the window onto its fur",
    sample_config=SampleConfig(
        steps=20,
        cfg_scale=4.0,
        solver=SolverType.FLOWMATCH_EULER,
    ),
    runtime_config=RuntimeConfig(
        seed=42,
        size=(1024, 1024),
    ),
)

๐ŸŽฏ Advanced Features

FP8 Quantized Inference

import torch
from kdit import Pipeline
from kdit.config import (
    ModelConfig,
    KsanaAttentionConfig,
    KsanaAttentionBackend,
    KsanaLinearBackend,
)

model_config = ModelConfig(
    run_dtype=torch.float16,
    attention_config=KsanaAttentionConfig(backend=KsanaAttentionBackend.SAGE_ATTN),
    linear_backend=KsanaLinearBackend.FP8_GEMM,
)

pipeline = Pipeline.from_models(
    ("high_noise_fp8.safetensors", "low_noise_fp8.safetensors"),
    model_config=model_config,
)

LoRA Accelerated Inference

from kdit import Pipeline
from kdit.config import LoraConfig, SampleConfig

pipeline = Pipeline.from_models(
    "path/to/Wan2.2-T2V-A14B",
    lora_config=LoraConfig("path/to/Wan2.2-Lightning-4steps-lora"),
)

# Fast generation with 4 steps
video = pipeline.generate(
    prompt,
    sample_config=SampleConfig(
        steps=4,
        cfg_scale=1.0,
        sigmas=[1.0, 0.9375, 0.6333, 0.225, 0.0],
    ),
)

Smart Cache Optimization - Under Active Development

from kdit.config.cache_config import (
    DCacheConfig,
    DBCacheConfig,
    HybridCacheConfig,
)

# Use hybrid caching strategy
cache_config = HybridCacheConfig(
    step_cache=DCacheConfig(fast_degree=50),
    block_cache=DBCacheConfig(),
)

video = pipeline.generate(
    prompt,
    cache_config=cache_config,
)

Multi-GPU Distributed Inference

# Method 1: Using CUDA_VISIBLE_DEVICES
CUDA_VISIBLE_DEVICES=0,1,2,3 python your_script.py

# Method 2: Using torchrun
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 your_script.py
from kdit import Pipeline
from kdit.config import DistributedConfig

pipeline = Pipeline.from_models(
    model_path,
    dist_config=DistributedConfig(num_gpus=4),
)

๐Ÿ“Š Performance Optimization Techniques

Quantization & Compute Optimization

Technique Description Effect
FP8 GEMM FP8 quantized matrix multiplication Reduced memory, improved speed
Torchao FP8 Dynamic Dynamic FP8 quantization Adaptive precision, balanced quality and performance
QKV Fuse QKV projection fusion Reduced memory access, improved throughput
torch.compile Graph compilation optimization 10-30% end-to-end speedup

Attention Backends

Backend Characteristics Use Case
Flash Attention High performance, memory efficient General recommendation
Sage Attention Optimized attention computation Long sequences
Sage SLA Top-k sparse attention Turbo Diffusion
Radial Sage Attention Radial sparse attention Very long sequences
Torch SDPA PyTorch native implementation Compatibility priority

Caching Strategies

Strategy Description Use Case
DCache Step-level caching with degree-based polynomial General video generation
TeaCache Temporal-aware step-level caching Video generation optimization
MagCache Adaptive step-level caching Balanced quality and speed
EasyCache Lightweight step-level caching without pre-prepared parameters Fast inference with minimal overhead
DBCache Block-level caching Image generation
HybridCache Step-level + block-level hybrid caching Maximum acceleration

Samplers

Sampler Description Use Case
Euler Fast sampling 4-8 step inference
UniPC High-quality sampling 20-40 step inference
DPM++ Efficient multi-step sampling General purpose
Turbo Diffusion Ultra-fast sampling 4-step inference
FlowMatch Euler Flow matching sampling Image generation

๐Ÿ”ง Configuration

Environment Variables

# Log level: debug/info/warn/error
export KSANA_LOGGER_LEVEL=info

Model Configuration

The framework supports model parameter configuration via YAML files, located in the kdit/settings/ directory:

๐Ÿ“š Code Examples

Complete example code is available in the examples/ directory:

๐Ÿงช Testing

We have comprehensive test coverage. Tests are currently time-consuming; we will continue to streamline them. For developers only.

# Run all tests
pytest tests/

# Run specific tests
pytest tests/kdit/pipelines/wan2_2_t2v_test.py

# Run GPU tests
bash scripts/ci_tests/ci_kdit_gpus.sh

๐Ÿค Contributing

We welcome community contributions! Before submitting a PR, please ensure:

  1. Code passes all tests
  2. Follows project code style (using git commit hook)
  3. Includes necessary documentation and comments
  4. Updates relevant README and examples
# Install development dependencies
pip install -e ".[dev]"

# Run code style checks
pre-commit run --all-files

# Run tests
pytest tests/

๐Ÿ“‹ Changelog

For a detailed list of changes in each version, see the CHANGELOG.

๐Ÿ“„ License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

This project benefits from the following excellent open-source projects:

๐Ÿ“ฎ Contact

๐Ÿ—บ๏ธ Roadmap

Completed โœ…

  • Multi-Platform Support: GPU, NPU, XPU backend support
  • Batch Inference: Support for batch size > 1, merged cond/uncond
  • Video Editing: Wan2.1 Vace video controllable editing
  • Advanced Samplers: DPM++, Turbo Diffusion support
  • Performance Optimization: QKV Fuse + Dynamic FP8 optimization
  • Memory Optimization: Pin Manager to resolve OOM issues
  • Smart Caching: MagCache, TeaCache, EasyCache strategies
  • Image Editing: Qwen Image Edit model support
  • VAE Parallelism: Multi-GPU VAE decoding
  • Monitoring: Inference metrics reporting

In Progress ๐Ÿšง

  • Support for more generation models (Z-Image, Hunyuan, etc.)
  • Memory optimization for longer video generation
  • Cache strategy performance tuning
  • Model quantization toolchain
  • XPU full feature support optimization

If this project helps you, please give us a โญ๏ธ Star!

Made with โค๏ธ by the kDiT Team

About

KsanaDiT: High-Performance DiT (Diffusion Transformer) Inference Framework for Video & Image Generation

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors