Replies: 2 comments
-
|
It is answered in #86. |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
Apologies for the confusion! As mentioned in #281 and #86, only the first 33 tokens are meaningful. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I am trying to use the esm3 sequence logits for a downstream analysis.
When I run forward on a single sequence, the shape of out.sequence_logits (see below for code) is 1 x input length x 64. The tokenizer implies the shape should be 33, but when I look into the code, indeed the sequence output is set to be dimension 64 (https://github.com/evolutionaryscale/esm/blob/0774600af03d724e8244d577c415e10617f018fe/esm/models/esm3.py#L160C9-L160C57).
Is it the case that only the first 33 tokens are meaningful (as would seem to be the case based on the the tokenizer)?
Am I missing something or is there a better way to get the sequence logits?
Thanks!
Code:
from huggingface_hub import login
from esm.models.esm3 import ESM3
from esm.sdk.api import ESM3InferenceClient, ESMProtein, GenerationConfig
login()
model: ESM3InferenceClient = ESM3.from_pretrained("esm3_sm_open_v1").to("cpu") # or "cuda"
from esm.tokenization.sequence_tokenizer import EsmSequenceTokenizer
tokenizer = EsmSequenceTokenizer()
prompt = "DQATSLRILNNGHAFNVEFDDSQDKAOO"
enc_prompt = tokenizer.encode(prompt)
input = torch.tensor(enc_prompt, dtype=torch.int64).unsqueeze(0)
out = model(sequence_tokens=input)
out.sequence_logits.shape # result: torch.Size([1, 30, 64])
Beta Was this translation helpful? Give feedback.
All reactions