Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
89 commits
Select commit Hold shift + click to select a range
339a89c
init
vasqu Jul 22, 2025
eb9d6b4
lets tmp disable cache init
vasqu Jul 22, 2025
4260a62
some initial remote code version, for local inference use remote proc…
vasqu Jul 22, 2025
26f06a2
first cleanups
vasqu Jul 22, 2025
b3d999a
need to do this slowly
vasqu Jul 23, 2025
1e190e2
more attention cleanup
vasqu Jul 23, 2025
b44101d
llama like text attention
vasqu Jul 23, 2025
b38e048
generates different text but cos and sin tensors are always close - 1e-8
vasqu Jul 23, 2025
fcf3903
another round of rope fixups
vasqu Jul 23, 2025
62206ee
yea, gonna check tomorrow cant cheat w freqs for whatever reason
vasqu Jul 23, 2025
7e7d8e4
NOTE: last time where comp with old rope
vasqu Jul 24, 2025
fca8fba
rope cleanup
vasqu Jul 24, 2025
db80573
more rope
vasqu Jul 24, 2025
e82297b
somewhat clean 3d rope with attn - sin / cos has very small diffs to …
vasqu Jul 24, 2025
8540938
new rope type
vasqu Jul 24, 2025
dfe6714
style
vasqu Jul 24, 2025
1153291
attempt at moe, gonna need a deeper look
vasqu Jul 24, 2025
39c77ef
cleanup gate
vasqu Jul 24, 2025
aadf423
more cleaning
vasqu Jul 24, 2025
096529d
NOTE remove attempt at moe for now
vasqu Jul 24, 2025
3820cc6
another round of cleanups
vasqu Jul 24, 2025
b25a458
whoops
vasqu Jul 24, 2025
04a7882
we back boys, reattempting moe start
vasqu Aug 13, 2025
b16737f
moe should be done with this
vasqu Aug 13, 2025
30acfda
cleanup
vasqu Aug 13, 2025
5b6efdd
more cleanup
vasqu Aug 13, 2025
46efff9
nits
vasqu Aug 13, 2025
7303a31
add conversion and adjust code accordingly
vasqu Aug 18, 2025
cba549f
fix
vasqu Aug 18, 2025
add956e
Merge branch 'main' into ernie_vl
vasqu Aug 18, 2025
01187e2
make moe copyable as far as we can
vasqu Aug 18, 2025
d5f7568
cleanup conversion a bit, next config
vasqu Aug 18, 2025
41e6cfc
cleanup config part1
vasqu Aug 19, 2025
5610549
small removal of unused things
vasqu Aug 19, 2025
414fb20
config conversion, rope type doesnt get loaded tho...
vasqu Aug 19, 2025
fe3e6d7
fix rope
vasqu Aug 19, 2025
20c2c22
last hardcoded values
vasqu Aug 19, 2025
ccea132
remove unnecessary class
vasqu Aug 19, 2025
d178a02
starting to make copies available for vision, vision rope refactor to…
vasqu Aug 19, 2025
e797a0a
vl rope changes
vasqu Aug 20, 2025
8ff1dea
simplify variable resolution resampler
vasqu Aug 20, 2025
f247b64
nit
vasqu Aug 20, 2025
5e2eca3
conversion update
vasqu Aug 22, 2025
73e7c79
more conversions, standardization, and big dtype fix!
vasqu Aug 22, 2025
1d2deac
remove some docs (tmp), focus on code for me
vasqu Aug 22, 2025
cfe0b4d
oops
vasqu Aug 22, 2025
b643da6
nit
vasqu Aug 22, 2025
6869aa9
fixup embeddings, add todos
vasqu Aug 22, 2025
b7363b9
more cleanup
vasqu Aug 25, 2025
c53b080
more cleanup, next caching changes
vasqu Aug 25, 2025
60e1073
revert fp16, internally discussed weights are supposed to be bf16
vasqu Aug 26, 2025
de04496
fix rope (a bit), prepare cache logic changes
vasqu Aug 26, 2025
ba0e2cd
more prep for cache
vasqu Aug 26, 2025
e38c511
cache class is used, fixup some flags
vasqu Aug 26, 2025
46cdb54
modular refactor
vasqu Aug 27, 2025
b004f0c
partially docstrings, docs, etc
vasqu Aug 27, 2025
777fe1f
cleaner order
vasqu Aug 27, 2025
8cd3bbe
nit
vasqu Aug 27, 2025
2446afa
fix config
vasqu Aug 27, 2025
43c9dfd
remove old artefacts/todos
vasqu Aug 27, 2025
41a919a
Merge branch 'main' into ernie_vl
vasqu Aug 27, 2025
3423440
sync with remote and add some todos for orientation
vasqu Sep 1, 2025
659ae74
remove img process dep on modeling code
vasqu Sep 1, 2025
9d1233e
image processor with a few diffs highlighted to copy from maybe
vasqu Sep 2, 2025
e4d0078
fast img processor version
vasqu Sep 3, 2025
76d9a6a
modular image processors
vasqu Sep 3, 2025
79dbeeb
convert tokenizer to have dedicated video placeholder token
vasqu Sep 3, 2025
4a77472
before i forget
vasqu Sep 3, 2025
910e86c
Merge branch 'main' into ernie_vl
vasqu Sep 3, 2025
3744960
a modular bug :/
vasqu Sep 4, 2025
a552294
more processor things, some modular adjustments
vasqu Sep 8, 2025
1316c18
remove dependency on token type ids
vasqu Sep 8, 2025
2e23c08
position ids ala qwen vl and modular is bugging
vasqu Sep 10, 2025
5233495
fixup some inheritances + nits
vasqu Sep 11, 2025
0dc7d15
token type ids
vasqu Sep 11, 2025
476bfeb
moe loss, docs, simplify pos ids
vasqu Sep 11, 2025
2284067
align some feature getters
vasqu Sep 11, 2025
d24eb0f
docs
vasqu Sep 11, 2025
28181f2
rename conv -> merge aka our naming convention
vasqu Sep 11, 2025
bf00c6e
style
vasqu Sep 11, 2025
a697dde
fixup tokenizer class in auto
vasqu Sep 11, 2025
0843b21
no more nn sequential
vasqu Sep 11, 2025
ff9746d
fix chat template, fix tokenizer conversion, modular bug
vasqu Sep 12, 2025
54d3f97
remove this
vasqu Sep 12, 2025
1f35405
remove old deps (from the remote processor)
vasqu Sep 12, 2025
0db2d94
Merge branch 'main' into ernie_vl
vasqu Sep 12, 2025
4031a4b
whoops
vasqu Sep 12, 2025
542e003
argh
vasqu Sep 12, 2025
3bf78d2
todo, restarting progress tomorrow
vasqu Sep 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -1019,6 +1019,8 @@
title: Donut
- local: model_doc/emu3
title: Emu3
- local: model_doc/ernie4_5_vl
title: ernie4_5_vl
- local: model_doc/evolla
title: Evolla
- local: model_doc/flava
Expand Down
64 changes: 64 additions & 0 deletions docs/source/en/model_doc/ernie4_5_vl.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
<!--Copyright 2025 The Qwen Team and The HuggingFace Inc. team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

<div style="float: right;">
<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white"> </div>
</div>

# ernie4_5_vl

## Overview

The ernie4_5_vl model was proposed in [<INSERT PAPER NAME HERE>](<INSERT PAPER LINK HERE>) by <INSERT AUTHORS HERE>.
<INSERT SHORT SUMMARY HERE>

The abstract from the paper is the following:

*<INSERT PAPER ABSTRACT HERE>*

Tips:

<INSERT TIPS ABOUT MODEL HERE>

This model was contributed by [INSERT YOUR HF USERNAME HERE](https://huggingface.co/<INSERT YOUR HF USERNAME HERE>).
The original code can be found [here](<INSERT LINK TO GITHUB REPO HERE>).


## Ernie4_5_VLConfig

[[autodoc]] Ernie4_5_VLConfig

## Ernie4_5_VLTextConfig

[[autodoc]] Ernie4_5_VLTextConfig

## Ernie4_5_VLTextModel

[[autodoc]] Ernie4_5_VLTextModel
- forward

## Ernie4_5_VLModel

[[autodoc]] Ernie4_5_VLModel
- forward

## Ernie4_5_VLForConditionalGeneration

[[autodoc]] Ernie4_5_VLForConditionalGeneration
- forward
58 changes: 58 additions & 0 deletions src/transformers/modeling_rope_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -376,6 +376,39 @@ def _compute_llama3_parameters(
return inv_freq_llama, attention_factor


def _comput_ernie_3d_parameters(
config: PretrainedConfig, device: "torch.device", seq_len: Optional[int] = None
) -> tuple["torch.Tensor", float]:
"""
Computes the inverse frequencies for the Ernie 4.5 VL models.

Args:
config ([`~transformers.PretrainedConfig`]):
The model configuration.
device (`torch.device`):
The device to use for initialization of the inverse frequencies.
seq_len (`int`, *optional*):
The current sequence length. Unused for this type of RoPE.
Returns:
Tuple of (`torch.Tensor`, `float`), containing the inverse frequencies for the RoPE embeddings and the
post-processing scaling factor applied to the computed cos/sin.
"""
# Gets the default RoPE parameters
inv_freq, attention_factor = _compute_default_rope_parameters(config, device, seq_len)

# Divide frequency allocation based on `freq_allocation`
# and apply necessary (pre-)rotations
t_dim = config.rope_scaling["freq_allocation"] # time dimension
hw_dim = inv_freq.shape[-1] - t_dim # height and width dimension

inv_freq_3d = torch.empty_like(inv_freq)
# (Pre-)Rotate to avoid another rotation during the forward
inv_freq_3d[:hw_dim] = torch.cat([inv_freq[:-t_dim][0::2], inv_freq[:-t_dim][1::2]])
inv_freq_3d[-t_dim:] = inv_freq[-t_dim:]

return inv_freq_3d, attention_factor


# This maps the "rope_type" string field in rope config to the corresponding function to compute the RoPE parameters
# from the model config. You can append new {'rope_type': callable} pairs to this dictionary to enable custom RoPE
# parameterizations, as long as the callable has the same signature.
Expand All @@ -386,6 +419,7 @@ def _compute_llama3_parameters(
"yarn": _compute_yarn_parameters,
"longrope": _compute_longrope_parameters,
"llama3": _compute_llama3_parameters,
"ernie_3d": _comput_ernie_3d_parameters,
}


Expand Down Expand Up @@ -604,6 +638,29 @@ def _validate_llama3_parameters(config: PretrainedConfig, ignore_keys: Optional[
)


def _validate_ernie_3d_parameters(config: PretrainedConfig, ignore_keys: Optional[set] = None):
rope_scaling = config.rope_scaling
rope_type = rope_scaling.get("rope_type", rope_scaling.get("type", None)) # BC: "rope_type" was originally "type"
required_keys = {"rope_type", "freq_allocation"}
received_keys = set(rope_scaling.keys())
_check_received_keys(rope_type, received_keys, required_keys, ignore_keys=ignore_keys)

partial_rotary_factor = config.partial_rotary_factor if hasattr(config, "partial_rotary_factor") else 1.0
head_dim = getattr(config, "head_dim", config.hidden_size // config.num_attention_heads)
dim = int(head_dim * partial_rotary_factor)

freq_allocation = rope_scaling["freq_allocation"]
t_dim = freq_allocation
h_dim = (dim - t_dim) // 2
reconstructed_dim = t_dim + 2 * h_dim
if freq_allocation is None or not isinstance(freq_allocation, int) or reconstructed_dim != dim:
logger.warning(
"`rope_scaling`'s freq_allocation field must be an int that can evenly be split into three dimensions: "
f"`freq_allocation` and 2 * (dim - freq_allocation). However, we found the following splits {t_dim}, {h_dim}, {h_dim};"
f"this does not split evenly into the total dim of {dim} vs. {reconstructed_dim}."
)


# Like `ROPE_INIT_FUNCTIONS`, this validation function mapping can be dynamically updated for custom RoPE types.
ROPE_VALIDATION_FUNCTIONS = {
"default": _validate_default_rope_parameters,
Expand All @@ -612,6 +669,7 @@ def _validate_llama3_parameters(config: PretrainedConfig, ignore_keys: Optional[
"yarn": _validate_yarn_parameters,
"longrope": _validate_longrope_parameters,
"llama3": _validate_llama3_parameters,
"ernie_3d": _validate_ernie_3d_parameters,
}


Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -137,6 +137,7 @@
("ernie", "ErnieConfig"),
("ernie4_5", "Ernie4_5Config"),
("ernie4_5_moe", "Ernie4_5_MoeConfig"),
("ernie4_5_vl", "Ernie4_5_VLConfig"),
("ernie_m", "ErnieMConfig"),
("esm", "EsmConfig"),
("evolla", "EvollaConfig"),
Expand Down Expand Up @@ -560,6 +561,7 @@
("ernie", "ERNIE"),
("ernie4_5", "Ernie4_5"),
("ernie4_5_moe", "Ernie4_5_MoE"),
("ernie4_5_vl", "Ernie4_5_VL"),
("ernie_m", "ErnieM"),
("esm", "ESM"),
("evolla", "Evolla"),
Expand Down
3 changes: 3 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
("ernie", "ErnieModel"),
("ernie4_5", "Ernie4_5Model"),
("ernie4_5_moe", "Ernie4_5_MoeModel"),
("ernie4_5_vl", "Ernie4_5_VLModel"),
("ernie_m", "ErnieMModel"),
("esm", "EsmModel"),
("evolla", "EvollaModel"),
Expand Down Expand Up @@ -956,6 +957,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
("blip", "BlipForConditionalGeneration"),
("blip-2", "Blip2ForConditionalGeneration"),
("chameleon", "ChameleonForConditionalGeneration"),
("ernie4_5_vl", "Ernie4_5_VLForConditionalGeneration"),
("git", "GitForCausalLM"),
("idefics2", "Idefics2ForConditionalGeneration"),
("idefics3", "Idefics3ForConditionalGeneration"),
Expand Down Expand Up @@ -997,6 +999,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin):
("deepseek_vl", "DeepseekVLForConditionalGeneration"),
("deepseek_vl_hybrid", "DeepseekVLHybridForConditionalGeneration"),
("emu3", "Emu3ForConditionalGeneration"),
("ernie4_5_vl", "Ernie4_5_VLForConditionalGeneration"),
("evolla", "EvollaForProteinText2Text"),
("florence2", "Florence2ForConditionalGeneration"),
("fuyu", "FuyuForCausalLM"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@
("deepseek_vl_hybrid", "DeepseekVLHybridProcessor"),
("dia", "DiaProcessor"),
("emu3", "Emu3Processor"),
("ernie4_5_vl", "Ernie4_5_VLProcessor"),
("evolla", "EvollaProcessor"),
("flava", "FlavaProcessor"),
("florence2", "Florence2Processor"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/tokenization_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,7 @@
("ernie", ("BertTokenizer", "BertTokenizerFast" if is_tokenizers_available() else None)),
("ernie4_5", (None, "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("ernie4_5_moe", (None, "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("ernie4_5_vl", (None, "LlamaTokenizerFast" if is_tokenizers_available() else None)),
("ernie_m", ("ErnieMTokenizer" if is_sentencepiece_available() else None, None)),
("esm", ("EsmTokenizer", None)),
(
Expand Down
24 changes: 10 additions & 14 deletions src/transformers/models/ernie4_5_moe/modeling_ernie4_5_moe.py
Original file line number Diff line number Diff line change
Expand Up @@ -268,12 +268,9 @@ class Ernie4_5_MoeStatics(nn.Module):
- Additionally, usage per expert in the original codebase
"""

def __init__(self, config):
def __init__(self, num_experts_groups, num_experts):
super().__init__()

num_experts_groups = 1
num_experts = config.moe_num_experts

self.e_score_correction_bias = nn.Parameter(
torch.zeros(num_experts_groups, num_experts, dtype=torch.float32),
requires_grad=False,
Expand Down Expand Up @@ -303,25 +300,22 @@ class Ernie4_5_MoeSparseMoeBlock(nn.Module):
(optional) shared experts and a corrections bias during gating.
"""

def __init__(self, config):
def __init__(self, config, num_experts, intermediate_size):
super().__init__()
self.num_experts = config.moe_num_experts
self.num_experts = num_experts
self.top_k = config.moe_k

# correction bias (yes it seems to be a typo with statics <> statistics)
self.moe_statics = Ernie4_5_MoeStatics(config)
self.moe_statics = Ernie4_5_MoeStatics(num_experts_groups=1, num_experts=self.num_experts)

# gating
self.gate = nn.Linear(config.hidden_size, config.moe_num_experts, bias=False, dtype=torch.float32)
self.experts = nn.ModuleList(
[Ernie4_5_MoeMLP(config, config.moe_intermediate_size) for _ in range(config.moe_num_experts)]
)
self.gate = nn.Linear(config.hidden_size, self.num_experts, bias=False, dtype=torch.float32)
self.experts = nn.ModuleList([Ernie4_5_MoeMLP(config, intermediate_size) for _ in range(self.num_experts)])
self.norm_min = config.moe_norm_min

# (optional) shared experts for all forwards
self.shared_experts = None
if config.moe_num_shared_experts > 0:
self.shared_experts = Ernie4_5_MoeMLP(config, config.moe_intermediate_size * config.moe_num_shared_experts)
self.shared_experts = Ernie4_5_MoeMLP(config, intermediate_size * config.moe_num_shared_experts)

def forward(
self,
Expand Down Expand Up @@ -395,7 +389,9 @@ def __init__(self, config, layer_idx):
and layer_idx >= config.moe_layer_start_index
and layer_idx <= config.moe_layer_end_index
):
self.mlp = Ernie4_5_MoeSparseMoeBlock(config)
self.mlp = Ernie4_5_MoeSparseMoeBlock(
config, num_experts=config.moe_num_experts, intermediate_size=config.moe_intermediate_size
)
else:
self.mlp = Ernie4_5_MoeMLP(config)

Expand Down
24 changes: 10 additions & 14 deletions src/transformers/models/ernie4_5_moe/modular_ernie4_5_moe.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,12 +76,9 @@ class Ernie4_5_MoeStatics(nn.Module):
- Additionally, usage per expert in the original codebase
"""

def __init__(self, config):
def __init__(self, num_experts_groups, num_experts):
super().__init__()

num_experts_groups = 1
num_experts = config.moe_num_experts

self.e_score_correction_bias = nn.Parameter(
torch.zeros(num_experts_groups, num_experts, dtype=torch.float32),
requires_grad=False,
Expand Down Expand Up @@ -111,25 +108,22 @@ class Ernie4_5_MoeSparseMoeBlock(nn.Module):
(optional) shared experts and a corrections bias during gating.
"""

def __init__(self, config):
def __init__(self, config, num_experts, intermediate_size):
super().__init__()
self.num_experts = config.moe_num_experts
self.num_experts = num_experts
self.top_k = config.moe_k

# correction bias (yes it seems to be a typo with statics <> statistics)
self.moe_statics = Ernie4_5_MoeStatics(config)
self.moe_statics = Ernie4_5_MoeStatics(num_experts_groups=1, num_experts=self.num_experts)

# gating
self.gate = nn.Linear(config.hidden_size, config.moe_num_experts, bias=False, dtype=torch.float32)
self.experts = nn.ModuleList(
[Ernie4_5_MoeMLP(config, config.moe_intermediate_size) for _ in range(config.moe_num_experts)]
)
self.gate = nn.Linear(config.hidden_size, self.num_experts, bias=False, dtype=torch.float32)
self.experts = nn.ModuleList([Ernie4_5_MoeMLP(config, intermediate_size) for _ in range(self.num_experts)])
self.norm_min = config.moe_norm_min

# (optional) shared experts for all forwards
self.shared_experts = None
if config.moe_num_shared_experts > 0:
self.shared_experts = Ernie4_5_MoeMLP(config, config.moe_intermediate_size * config.moe_num_shared_experts)
self.shared_experts = Ernie4_5_MoeMLP(config, intermediate_size * config.moe_num_shared_experts)

def forward(
self,
Expand Down Expand Up @@ -203,7 +197,9 @@ def __init__(self, config, layer_idx):
and layer_idx >= config.moe_layer_start_index
and layer_idx <= config.moe_layer_end_index
):
self.mlp = Ernie4_5_MoeSparseMoeBlock(config)
self.mlp = Ernie4_5_MoeSparseMoeBlock(
config, num_experts=config.moe_num_experts, intermediate_size=config.moe_intermediate_size
)
else:
self.mlp = Ernie4_5_MoeMLP(config)

Expand Down
Binary file not shown.
28 changes: 28 additions & 0 deletions src/transformers/models/ernie4_5_vl/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Copyright 2025 The Qwen Team and The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING

from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure


if TYPE_CHECKING:
from .configuration_ernie4_5_vl import *
from .modeling_ernie4_5_vl import *
from .processing_ernie4_5_vl import *
else:
import sys

_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
Loading