-
Notifications
You must be signed in to change notification settings - Fork 67
Open
Description
Hi.
First of all, thank you for making such a model available to us.
I am trying to get vector embeddings of abstracts of some of the articles in PubMed. But somehow I couldn't get the sentence embeddings. More precisely, I wrote the code below and the dimensions of the vectors I obtained are 2560. But on the huggingface page, it says sequence length is 1024. So I understand that the dimension of an embedding vector should be 1024. Am I wrong?
Can you help with getting sentence embeddings?
Best wishes.
Orhan
tokenizer = AutoTokenizer.from_pretrained("BioMedLM")
model = AutoModel.from_pretrained("BioMedLM")
tokenizer.pad_token = tokenizer.eos_token
f = open('articles.json', "r")
data = json.loads(f.read())
data_abst = [data[i]['abstract'] for i in range(len(data))]
data_title = [data[i]['title'] for i in range(len(data))]
def normalizer(x):
normalized_vector = x / np.linalg.norm(x)
return normalized_vector
class BioMedLM:
def __init__(self, model, tokenizer):
# self.sentence = sentence
self.model = model
self.tokenizer = tokenizer
def sentence_vectors(self,sentence):
inputs = self.tokenizer(sentence, padding=True, truncation=True, return_tensors="pt")
w_vectors = self.model(**inputs)
# return w_vectors
token_embeddings = w_vectors[0] #First element of model_output contains all token embeddings
input_mask_expanded = inputs.attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
vec=torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
return vec[0]
gpt_class = BioMedLM(model, tokenizer)
def sentence_encoder(data):
vectors = []
normalized_vectors = []
for i in range(len(data)):
sentence_vectors = gpt_class.sentence_vectors(data[i]).detach().numpy()
vectors.append(sentence_vectors)
normalized_vectors.append(normalizer(sentence_vectors))
vectors = np.squeeze(np.array(vectors))
normalized_vectors = np.squeeze(np.array(normalized_vectors))
return vectors, normalized_vectors
abst_vectors, abst_vectors_norm = sentence_encoder(data_abst)
Metadata
Metadata
Assignees
Labels
No labels