Skip to content

serin-varghese/STVFormer

 
 

STVFormer

STVFormer is a semantic segmentation framework built on top of OpenMMLab’s MMSegmentation. This repository extends MMSegmentation with a novel architecture—Spatial-Temporal Vision Transformer (STVFormer)—designed to enhance segmentation performance in video and sequential imagery by leveraging both spatial and temporal contextual cues.

This repo inherits MMSegmentation’s modular design, training pipelines, and evaluation tools, and integrates additional components specific to STVFormer, including custom backbones, temporal fusion modules, and dataset loaders optimized for sequential data.

demo image

Installation

Step1: Installation of mmsegmentation version. Forked from version mmsegmentation 1.2.2

Please refer to get_started.md for installation and dataset_prepare.md for dataset preparation.

Step2: Preparation of datasets

Cityscapes

Download the cityscapes dataset. Once you have logged in, your will have your username and password. For additional helpers and extended utilities, see:

  • city-scapes-script by cemsaz: A collection of scripts for automated downloading, extraction, and organization of the Cityscapes dataset. Useful for simplifying dataset preparation pipelines.

Cityscapes Images and Fine Annotations

The Cityscapes dataset contains the images and their fine annotations. The images are in the leftImg8bit folder, and the annotations are in the gtFine folder.

Cityscapes Coarse Annotations

The Cityscapes dataset also contains the coarse annotations. The images are in the leftImg8bit_sequence folder, and the annotations are in the gtCoarse folder. The coarse annotations are used for training the STVFormer model.

Cityscapes DemoVideo

The demoVideo is in the leftImg8bit/demoVideo folder. Additionally, for each cityscapes training images, there are 30 frames of video from which the training images are sampled.

Cityscapes Sequences

The training image with the coarse annotations is always the 19th frame sampled from the sequence of 30 frames. The sequence of 30 frames is in the leftImg8bit_sequence folder. The sequence of 30 frames is used for training the STVFormer model.

BDD100k

TODO

Training with STVFormer

python tools/train.py configs/stvformer/stvformer_cityscapes.py --work-dir work_dirs/stvformer_cityscapes

Inference with STVFormer

For inference, we either use weights and biases or the

python tools/STV_Inference.py

Contributing

We try to build upon MMSegmentation. Please refer to CONTRIBUTING.md for the contributing guideline.

About

STVFormer Spatio-Temporal Video Semantic Segmentation: Extension of OpenMMLab Semantic Segmentation Toolbox and Benchmark.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 98.7%
  • Other 1.3%