What if the visual transformer does not have a class token?

I see the code in the VITAttentionGradRollout code requires a class token. What if the model architecture does not have a class token? 

If for example my attention layer is 196x196 (corresponding to 14x14 spatial resolution), can one take the mean of all other patches w.r.t. to each patch, as follows: mask = result[0].mean(0)? I've tried this and I didn't get very meaningful results. Is there another way to deal with transformers without class tokens?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What if the visual transformer does not have a class token? #25

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

What if the visual transformer does not have a class token? #25

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions