Releases: linkedin/Liger-Kernel
v0.7.0
🚀 Liger-Kernel Now Fully Supports Transformers v5
We’ve added full support for Transformers v5!
🔗 #994
Liger now supports all 🤗 Transformers versions ≥ 4.52.0, including the latest v5 release.
Broader compatibility. Seamless upgrades. No version headaches.
Thanks to all the contributors!
What's Changed
- Add CISPO loss type support for LigerFusedLinearGRPOLoss by @yukiu00 in #1054
- Update checkstyle and fix the format issue by @xuedinge233 in #1071
- Add SAPO loss type support for LigerFusedLinearGRPOLoss by @yukiu00 in #1073
- [NPU]: Adaptive modification of NPU by @TianHao324 in #1055
- [NPU] Frequencies fusion for Llama4_rope on NPU by @lowdy1 in #1053
- Add CISPO and SAPO loss type support for Triton GRPO loss kernel by @yukiu00 in #1074
- Add vLLM importance sampling ratio support for GRPO loss by @yukiu00 in #1088
- Relaxing logp relative tolerances for mini-llama4 to fix flaky test. by @kolehma8 in #1089
- Moving unit testing to a merge queue. by @kolehma8 in #1069
- Fixing unit tests. by @kolehma8 in #1092
- Merge queue test. by @kolehma8 in #1095
- Changes to the nvidia tests in merge queue. by @kolehma8 in #1097
- Support transfomers v5 by @Tcc0403 in #994
- Remove latest v4 test to reduce cost by @Mecoli1219 in #1098
- [Model] Pixtral Support by @AndreSlavescu in #253
- [NPU]: NPU-optimized rms_norm kernel by @TianHao324 in #1099
- [NPU]: NPU-optimized fused_add_rms_norm kernel by @TianHao324 in #1070
- [NPU]: add support for grpo loss by @TianHao324 in #1049
- Update pyproject.toml for v0.7.0 release by @vaibhavjindal in #1102
New Contributors
- @xuedinge233 made their first contribution in #1071
Full Changelog: v0.6.5...v0.7.0
v0.6.5
What's Changed
- fixbug, ensure FP32 accumulation for dW in Llama-mode RMSNorm backward by @niyunsheng in #950
- Add Ascend NPU device support. by @Ginray in #955
- define shift_labels in gemma by @akoumpa in #961
- [feat]: Add support for gpt-oss by @yeshsurya in #949
- Update README.md by @PKUWZP in #970
- Fix qwen3vl
apply_rotary_pos_emb_visionby @Tcc0403 in #967 - [refactor] decoupling ops implementations for different vendors by @pillumina in #973
- Fix: fix ignore_index not being applied in JSD distillation loss by @roycho96 in #974
- ci: skip some rms_norm test cases for npu and bump torch-npu to 2.7.1 by @ji-huazhong in #977
- Fix missing property access for multimodal models by @albertvillanova in #966
- Bug fix for missing distillation loss arguments. by @kolehma8 in #983
- Add
AutoLigerKernelForCausalLM.from_configby @Tcc0403 in #962 - Fix geglu by @konstantinos-p in #986
- Update discord channel link and announcement for meetup by @momochen in #984
- [NPU]: Adjust MAX_FUSED_SIZE when using fused_linear_cross_entropy by @zheliuyu in #985
- [RMSNorm] Fix JIT recompilation by removing tl.constexpr on rows_per_program & Cleanup Block kernel interface by @niyunsheng in #988
- [Feature] Add elementwise_affine argument to LigerRMSNorm by @niyunsheng in #989
- [Fix] Handle missing 'elementwise_affine' in RMSNorm extra_repr for patched models by @niyunsheng in #990
- [Fix] Replace conditional flow with
tl.whereinliger_cross_entropy_kernelfor Triton 3.2 compatibility by @niyunsheng in #991 - feat(NPU): add UB Manager for auto tiling strategy management by @noemotiovon in #987
- [NPU]: Add NPU support for the mrope operator by @TianHao324 in #992
- Changing RMS layer norm to accept DTensors. by @kolehma8 in #982
- [NPU]: Add NPU support for the swiglu by @jiaqiw09 in #995
- fix: prevent command injection in benchmark workflow by @arde171 in #997
- [Model] Exaone4 Support by @roycho96 in #980
- Upgrade GitHub Actions for Node 24 compatibility by @salmanmkc in #981
- [NPU]: avoid pointer mutation in rms_norm kernel by @TianHao324 in #1000
- [Fix]: avoid pointer mutation in group norm kernel by @noemotiovon in #999
- [NPU]: support kl div on NPU by @noemotiovon in #1001
- fix: replace HybridCache with Cache in gemma2 and gemma3 by @qgallouedec in #1002
- [NPU]: refine geglu memory_multiplier based on UB analysis by @noemotiovon in #996
- [NPU]: Add NPU support for the tvd operator by @TianHao324 in #998
- [Test]: Add test suite for Llama4 RoPE implementation by @noemotiovon in #1004
- feat: Align TiledMLP with DeepSpeed/ALST/Axolotl for PEFT compatibility by @akshatvishu in #1005
- [NPU]: adjust MAX_FUSED_SIZE for NPU devices in group_norm by @noemotiovon in #1003
- avoid pointer mutation in add_rms_norm kernel by @TianHao324 in #1008
- pass param down to LigerFusedLinearCrossEntropyLoss by @kaixuanliu in #1010
- avoid pointer mutation in layer_norm kernel by @TianHao324 in #1006
- XPU: Enable new grf_mode settings by @Egor-Krivov in #1016
- gemma3 consider loss_kwargs by @jp1924 in #1007
- [Refactor]: optimize poly_norm backward kernel pointer handling by @noemotiovon in #1018
- Enable
benchmark_tvd.pyfor xpu devices by @Egor-Krivov in #1024 - Add pre-commit config by @Tcc0403 in #1009
- [NPU]: update the native KLDivLoss implementation for comparison. (eg.)test_jsd.py by @kiritorl in #1032
- doc: reformat contributing.md for better visualization on github by @Tcc0403 in #1033
- [NPU]: Add NPU support for the embedding by @TianHao324 in #1028
- [NPU]: use get_soc_spec for UB capacity detection by @noemotiovon in #1038
- [NPU]: optimize GEGLU implementation with flatten 1D approach by @noemotiovon in #1031
- [NPU]: optimize tvd implementation by @TianHao324 in #1039
- Workaround for OOM error on
benchmark_jsdby @Egor-Krivov in #1037 - Fix
benchmark_qwn2vl_mrope.pyby enabling new transformers API for Qwen2VLRotaryEmbedding by @Egor-Krivov in #1026 - [NPU] FIX CE ub overflow on NPU by @lowdy1 in #1040
- Avoid OOM error for
benchmark_tvd.pyon GPUs with less than 66GB of memory by @Egor-Krivov in #1042 - [NPU]: optimize rope and mrope implementation by @TianHao324 in #1041
- [NPU] Add Llama4_rope support on NPU by @lowdy1 in #1035
- Fix memory requirements for
benchmark_jsd.pyandbenchmark_distill_jsd_loss.pyby @Egor-Krivov in #1050 - Set transformers upper bound by @Tcc0403 in #1046
- Unify NPU vector core count helpers by @lowdy1 in #1052
- New version release v0.6.5 by @vaibhavjindal in #1063
New Contributors
- @Ginray made their first contribution in #955
- @akoumpa made their first contribution in #961
- @pillumina made their first contribution in #973
- @roycho96 made their first contribution in #974
- @ji-huazhong made their first contribution in #977
- @albertvillanova made their first contribution in #966
- @kolehma8 made their first contribution in #983
- @konstantinos-p made their first contribution in #986
- @zheliuyu made their first contribution in #985
- @noemotiovon made their first contribution in #987
- @TianHao324 made their first contribution in #992
- @jiaqiw09 made their first contribution in #995
- @arde171 made their first contribution in #997
- @salmanmkc made their first contribution in #981
- @qgallouedec made their first contribution in #1002
- @akshatvishu made their first contribution in #1005
- @kaixuanliu made their first contribution in #1010
- @kiritorl made their first contribution in #1032
- @lowdy1 made their first contribution in #1040
Full Changelog: v0.6.4...v0.6.5
v0.6.4 release
Highlights
New model architecture:
Qwen3-VL, hunyuanv1, Olmo3
New algorithm:
DAPO loss
Optimizations:
Layernorm backward, Tiled MLP
What's Changed
- Option to return hard and soft loss when using distillation by @h-aurelien-lac in #895
- Fix CE patch and add layernorm support for InternVL by @MilkClouds in #921
- fix(ci): modify Glm4vMoe config for convergence test by @Tcc0403 in #918
- Support for Qwen3-VL models by @mayankagarwals in #911
- style: fix main branch format by @Tcc0403 in #929
- fix: initialize grad_weight and grad_bias on flce no_grad path by @keatonelvins in #931
- Fix qwen3 related tests by @vaibhavjindal in #933
- [Cross-entropy-loss] return mean token accuracy metric with CE loss by @kashif in #910
- Handle aux_loss for different transformer versions by @vaibhavjindal in #934
- Add TiledMLP Implementation by @upskyy in #935
- [Qwen3]: If qwen3 is used along with peft config, peft adds opcl obj no… by @yeshsurya in #926
- Increase time limit for modal tests by @vaibhavjindal in #947
- add hunyuanv1 dense and moe model by @Kingsleyandher in #940
- Olmo3 model support [ready for review] by @tyler-romero in #946
- [GRPO] add support for dapo loss by @kashif in #939
- [Perf] Optimize LayerNorm Backward: Replace Atomics with Persistent Reduction by @niyunsheng in #945
New Contributors
- @h-aurelien-lac made their first contribution in #895
- @mayankagarwals made their first contribution in #911
- @keatonelvins made their first contribution in #931
- @upskyy made their first contribution in #935
- @yeshsurya made their first contribution in #926
- @Kingsleyandher made their first contribution in #940
- @niyunsheng made their first contribution in #945
Full Changelog: v0.6.3...v0.6.4
v0.6.3 release
Highlights in this release:
New model architecture supports:
SmolVLM2, GLM4.5V, InternVL3, Falcon-H1, Qwen-Next
New algorithm:
GSPO
What's Changed
- [cross-entropy-loss] Added support for DFT flag by @kashif in #860
- fix(test): update assertions in GLM4 instance patching tests by @vvvdwbvvv in #859
- Fix nan loss error for LigerFusedLinearJSDLoss by @ParagEkbote in #862
- [Cross-entropy] get valid predicted probabilities by @kashif in #864
- Enhance Docs by @ParagEkbote in #867
- Add Classifiers for Liger-Kernel by @ParagEkbote in #869
- docs(mta): suppress invalid sequence syntax warning by @Tcc0403 in #870
- Add GSPO by @BjarniHaukur in #845
- Add GLM4.5V support by @vvvdwbvvv in #863
- A Fix for Issue #872 by @yshenaw in #879
- Add pytest coverage for liger-kernel by @ParagEkbote in #876
- Replace all torch_dtype with dtype by @Tcc0403 in #881
- Update Dev Dependencies by @ParagEkbote in #886
- Fixed AMD CI issue #793 by @DevManpreet5 in #887
- fix(layernorm): remove
n_colsupcasting for for torch.compile by @Tcc0403 in #884 - Fix tests and CI by @vaibhavjindal in #882
- Remove daily test cron job by @vaibhavjindal in #890
- [UT] [XPU] Modify the test cases of XPU for triton3.5 by @YangKai0616 in #889
- Add InternVL3 support by @MilkClouds in #878
- fix(flce): add
shift_labelsas eval mode loss condition by @Tcc0403 in #888 - Add support of Falcon-H1 models for liger kernels by @puneeshkhanna in #900
- Don't deploy mkdocs to fix benchmarking by @vaibhavjindal in #904
- Disable mllama multimodal test in transformers<4.51.0 by @Tcc0403 in #899
- Add flce forward for FalconH1ForCausalLM and missing tests by @Tcc0403 in #903
- feat(ce,flce): decouple gradients computation for no_grad mode by @Tcc0403 in #894
- fix(llama4): Get correct swiglu patch target for llama4 moe layer by @alenawang in #907
- Add PolyNorm operator by @0xtoward in #901
- Copy and paste benchmarks before and after gh-pages deployment by @vaibhavjindal in #909
- Filter out redundant ops/allocations in no_grad mode by @Tcc0403 in #906
- Add support for Qwen3Next model with Liger kernels by @vvvdwbvvv in #912
- refactor(convergence-test): remove unnecessary print by @Tcc0403 in #913
- Enabled the tests glm4v/glm4v_moe for XPU and Fixed the monkey patch error by @YangKai0616 in #914
- [Test][XPU] Added gpu cache cleaning for XPU devices by @Egor-Krivov in #917
- Add SmolVLM2 support by @MilkClouds in #919
New Contributors
- @BjarniHaukur made their first contribution in #845
- @yshenaw made their first contribution in #879
- @DevManpreet5 made their first contribution in #887
- @MilkClouds made their first contribution in #878
- @puneeshkhanna made their first contribution in #900
- @alenawang made their first contribution in #907
- @0xtoward made their first contribution in #901
Full Changelog: v0.6.2...v0.6.3
v0.6.2
What's Changed
- Automate Benchmarking - fixing issue. by @Manan17 in #836
- Make path variable global by @Manan17 in #840
- Adding support for apo losses, sppo_hard and nca_pair by @Manan17 in #841
- Add
accum_dtypeoption forFusedLinearCrossEntropyby @Tcc0403 in #830 - CI tests fix by @Manan17 in #847
- docs(README): fix intel ci link by @Tcc0403 in #842
- Llama4 rope implementation by @Manan17 in #843
- fix(phi3): update monkey patch for
Phi3ForCausalLMby @Tcc0403 in #837 - feat(FLCE): expose
accum_dtypefor hf model monkey patch by @Tcc0403 in #851 - Fix ci by @Manan17 in #853
- Fix missing low-level api imports by @Kirill-Kravtsov in #856
- Add glm4.1v model support by @vvvdwbvvv in #858
- Update pyproject.toml version to 0.6.2 by @vaibhavjindal in #861
New Contributors
- @Kirill-Kravtsov made their first contribution in #856
Full Changelog: v0.6.1...v0.6.2
v0.6.1
What's Changed
- Fix gemma3 forward with skip_logits by @BitPhinix in #795
- Update README.md by @PKUWZP in #808
- Fix minor typo by @hugoabonizio in #809
- Update README.md by @PKUWZP in #811
- Fix embedding benchmarks for backward pass by @Manan17 in #799
- Giving an option to update benchmark results for previous commits. by @Manan17 in #791
- [Model] Liger support for SmolLM3 by @edbeeching in #798
- FusedAddRMSNorm: Fused residual addition and RMS Norm by @vaibhavjindal in #812
- Skip smollm3 tests in tests-bwd by @vaibhavjindal in #821
- Layernorm enhancement by @Manan17 in #815
- Update README.md by @PKUWZP in #823
- Update index.md by @PKUWZP in #824
- Remove smollm3 import at top of file by @vaibhavjindal in #825
- Fix illegal memory access in Triton RMSNorm kernel by casting program_id to int64 by @vvvdwbvvv in #804
- fix(benchmark): move chunked loss module init out of measurements by @Tcc0403 in #643
- [XPU]Fixed the issue with multiple num_warps parameters being passed in. by @YangKai0616 in #831
- Automate benchmarking - for every release by @Manan17 in #828
- Revert "Bug Fix: name patching for modules" by @vaibhavjindal in #833
- Bug fixes in patching module by @vaibhavjindal in #834
- docs(README): fix gpumode discord badge by @Tcc0403 in #835
- Update pyproject.toml version to 0.6.1 by @shimizust in #838
New Contributors
- @BitPhinix made their first contribution in #795
- @PKUWZP made their first contribution in #808
- @hugoabonizio made their first contribution in #809
- @edbeeching made their first contribution in #798
Full Changelog: v0.6.0...v0.6.1
v0.6.0: New Attention Operators, Cosine Similarity Loss, Llama 4, and VLM Patching Updates
Highlights
This release introduces significant improvements to Liger-Kernel, including new operators, support for Llama 4 models, more robust benchmarking automation, and key fixes for patching of vision-language models (VLMs) due to recent transformers refactoring.
Key Changes
New Features & Improvements
- Multi-Token Attention by @AndreSlavescu (#689)
- Fused Neighborhood Attention by @AndreSlavescu (#732)
- Cosine Similarity Loss for Distillation by @Dexterai (#780)
- Support for Llama 4 by @Manan17 (#740)
- Option to choose fused LCE/CE loss by @connermanuel (#704)
- Add block_rms_norm for QK norm by @mdy666 (#731)
Bug Fixes
- Vision-language model patching in recent transformers versions (>=4.52.0):
- RMS Norm patching by @vaibhavjindal, @BenasdTW (#741, #765)
- Hugging Face forward kwargs fix by @llllvvuu (#708)
- Fix import tanh by @jue-jue-zi (#762)
- Apply monkey patch to instances by @YangKai0616 (#772)
Documentation & CI Fixes
- Deploy MkDocs to GitHub Pages by @ParagEkbote (#724)
- Robust doc updates by @ParagEkbote (#726, #727)
- .idea ignored by @Tcc0403 (#784)
- ReadMe, MTA + softmax docs by @AndreSlavescu (#730)
- Relax DyT tol, XPU skip MTA by @Tcc0403 (#778)
- Paligemma test fixes by @vvvdwbvvv (#785)
- Style & test fixes by @Tcc0403, @vaibhavjindal (#736, #794)
- Add torchvision for multimodal test by @Tcc0403 (#755)
Benchmarking & Automation
- Automated benchmarking and visualization UI in GitHub pages by @Manan17 (#744, #747, #749, #752, #753, #756, #759, #760, #770, #779)
New Contributors
- @connermanuel made their first contribution in #704
- @llllvvuu made their first contribution in #708
- @jue-jue-zi made their first contribution in #762
- @YangKai0616 made their first contribution in #772
- @Dexterai made their first contribution in #780
- @vvvdwbvvv made their first contribution in #785
Full Changelog: v0.5.10...v0.6.0
v0.5.10: Qwen3 MOE support, Sparsemax kernel, bug fixes
What's Changed
- fix zip bug by @KareemMusleh in #702
- [dpo] set default average_log_prob to False by @cyr0930 in #693
- Rank build status lower by @momochen in #707
- Add support for Qwen3 MoE models by @chiwanpark in #706
- Fix qwen3_moe flaky convergence test by @vaibhavjindal in #710
- Fix empty Medusa head tensors by @chiwanpark in #698
- Sparsemax by @AndreSlavescu in #687
- fix: remove docstring imports in transformer patches by @NanoCode012 in #712
- Increase tests timeout to 45 mins by @vaibhavjindal in #718
- fix modal tests by @shivam15s in #719
- Visualizer Update by @AndreSlavescu in #717
- Sparsemax Documentation by @AndreSlavescu in #716
- element-wise-DyT faster than the origin LigerDyT by @mdy666 in #673
- GRPO Loss kernel fully write by triton, reduce 46G memory by @mdy666 in #672
- Make FLCE compatible with FSDP and PEFT by @astefanutti in #674
- Fix incorrect module patching when using LoRA with modules_to_save by @BenasdTW in #632
- [XPU] Changed how XPU discovery works during
setup.pyby @Egor-Krivov in #720 - Fix to publish docs on pushes to main branch by @shimizust in #722
- Release 0.5.10 by @shimizust in #725
New Contributors
- @KareemMusleh made their first contribution in #702
- @cyr0930 made their first contribution in #693
- @NanoCode012 made their first contribution in #712
- @mdy666 made their first contribution in #673
- @astefanutti made their first contribution in #674
- @Egor-Krivov made their first contribution in #720
Full Changelog: v0.5.9...v0.5.10
v0.5.9: Adds XPU Setup, GLM-4 & Qwen3 Model Support, Key Bugfixes
What's Changed
- update setup.py for installation on xpu by @faaany in #668
- update XPU CI yaml file to use docker container by @faaany in #669
- Add average_log_prob as an init param for LigerFusedLinearDPOLoss by @vaibhavjindal in #676
- add shift label change by @shivam15s in #683
- remove tests that can pass on XPU by @faaany in #686
- Update mkdocs.yml by @shivam15s in #691
- Fix LigerCrossEntropy reduction='none' by @Tcc0403 in #680
- Support GLM-4 models by @intervitens in #685
- Import glm4_lce_forward locally in function by @vaibhavjindal in #695
- Qwen3 model support by @vaibhavjindal in #692
- Use logits_to_keep logic for training runs by @vaibhavjindal in #696
- increase gemma3 multimodal convergence test loss atol by @shivam15s in #697
- Update pyproject.toml by @shivam15s in #700
New Contributors
- @intervitens made their first contribution in #685
Full Changelog: v0.5.8...v0.5.9
v0.5.8: Backward-Compatible Fix
What's Changed
- backward compatible initialization by @shivam15s in #666
- Update pyproject.toml by @shivam15s in #667
Full Changelog: v0.5.7...v0.5.8