When I use the run_frame_captioning_and_visual_tokenization.sh to extract visual tokenization and frame captioning for my own dataset, I meet the following issue under run_video_CapFilt.py file:
File "/extract_frame_concepts/models/med.py", line 178, in forward
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
RuntimeError: The size of tensor a (12) must match the size of tensor b (36) at non-singleton dimension 0
Is this because I did something wrong?
When I use the run_frame_captioning_and_visual_tokenization.sh to extract visual tokenization and frame captioning for my own dataset, I meet the following issue under run_video_CapFilt.py file:
File "/extract_frame_concepts/models/med.py", line 178, in forward
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
RuntimeError: The size of tensor a (12) must match the size of tensor b (36) at non-singleton dimension 0
Is this because I did something wrong?