Questions about Text-to-Image Pretraining

Hello, in the training process of the paper, it is introduced that Full 3D attention is first trained on a 256*256 image dataset. I would like to know how to train 3D attention with image data? Because image data has one less dimension than video data.