weird logic in positional embedding in APTModel (self.wpe)?

I'm finding our handling of the initial positional embeddings, before the APT blocks, (`self.wpe` or its absence in the definition of `APTModel`) to be a bit weird.
They are initialized here:
https://github.com/OpenBioML/protein-lm-scaling/blob/86ca8f5a9c756267b746ee05d26242e1c6e2d5c5/protein_lm/modeling/models/apt/model_pytorch.py#L453-L460
and used here:
https://github.com/OpenBioML/protein-lm-scaling/blob/86ca8f5a9c756267b746ee05d26242e1c6e2d5c5/protein_lm/modeling/models/apt/model_pytorch.py#L567-L571

It seems that for learned embedding _as well as for variants of rope_, a learned positional embedding is added before passing on to the blocks. Only for alibi is this positional embedding omitted. (The APT blocks have rope/alibi as was specified, so this first positional embedding being omitted does not mean that these positional embeddings are never used.)
This seems weird to me because I don't see why rope should be grouped with learned embeddings. It makes more sense to me for rope variants to also omit having an initial positional embedding (i.e., no `self.wpe`). I would also be more okay with all of them having an initial positional embedding, but this doesn't seem the standard way language models are implemented e.g., in llama.

Tagging @talkhanz who I think was the original author of this logic, and @jamaliki @jeffreyruffolo  @NZ99 @pascalnotin for their thoughts.

	if self.position_embedding=="learned" or self.position_embedding == 'rope' or self.position_embedding == 'rerope' or self.position_embedding=="linear_rope_scaling" or self.position_embedding =="dynamic_rope_scaling":
	self.wpe = nn.Embedding(config.max_position_embeddings, self.embed_dim)
	self.alibi = None
	elif self.position_embedding=="alibi":
	maxpos = config.n_positions
	attn_heads = config.n_head
	alibi = create_alibi_tensor(attn_heads,maxpos)
	self.register_buffer('alibi',alibi)

	if self.position_embedding=="learned" or self.position_embedding == 'rope' or self.position_embedding == 'rerope' or self.position_embedding=="linear_rope_scaling" or self.position_embedding =="dynamic_rope_scaling":
	position_embeds = self.wpe(position_ids)
	hidden_states = inputs_embeds + position_embeds
	else:
	hidden_states = inputs_embeds

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

weird logic in positional embedding in APTModel (self.wpe)? #65

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

weird logic in positional embedding in APTModel (self.wpe)? #65

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions