SegViT: Semantic Segmentation with Plain Vision Transformers

Authors: Bowen Zhang, Zhi Tian, Quan Tang, Xiangxiang Chu, Xiaolin Wei, Chunhua Shen, Yifan liu

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that our proposed Seg Vi T using the ATM module outperforms its counterparts using the plain Vi T backbone on the ADE20K dataset and achieves new state-of-the-art performance on COCO-Stuff-10K and PASCAL-Context datasets. Furthermore, to reduce the computational cost of the Vi T backbone, we propose query-based down-sampling (QD) and query-based up-sampling (QU) to build a Shrunk structure. With the proposed Shrunk structure, the model can save up to 40% computations while maintaining competitive performance.
Researcher Affiliation Collaboration Bowen Zhang1 , Zhi Tian2 , Quan Tang4, Xiangxiang Chu2, Xiaolin Wei2, Chunhua Shen3, Yifan Liu1 1 The University of Adelaide, Australia 2 Meituan Inc. 3 Zhejiang University, China 4 South China University of Technology, China
Pseudocode No The paper includes architectural diagrams (e.g., Figure 2) and mathematical formulations, but it does not present any pseudocode or algorithm blocks.
Open Source Code Yes We included the code for the main experiments in the supplemental materials. All the code will be released upon acceptance.
Open Datasets Yes ADE20K [26] is a challenging scene parsing dataset which contains 20, 210 images as the training set and 2, 000 images as the validation set with 150 semantic classes. COCO-Stuff-10K [27] is a scene parsing benchmark with 9, 000 training images and 1, 000 test images. PASCAL-Context [29] is a dataset with 4, 996 images in training set and 5, 104 images in the validation set.
Dataset Splits Yes ADE20K [26] is a challenging scene parsing dataset which contains 20, 210 images as the training set and 2, 000 images as the validation set with 150 semantic classes. PASCAL-Context [29] is a dataset with 4, 996 images in training set and 5, 104 images in the validation set.
Hardware Specification No The paper mentions "GPU memory consumed by the global attention mechanism" and refers to "type of GPUs" in the author checklist, but it does not specify any particular GPU models (e.g., NVIDIA A100, RTX series), CPU types, or other hardware components used for the experiments.
Software Dependencies No The paper mentions using "MMSegmentation [28]" and "fvcore 2 library https://github.com/facebookresearch/fvcore" but does not provide specific version numbers for these or other key software dependencies like PyTorch, TensorFlow, Python, or CUDA.
Experiment Setup Yes During training, we applied data augmentation sequentially via random horizontal flipping, random resize with the ration between 0.5 and 2.0 and random cropping (512 512 for all except that we use 480 480 for PASCAL-Context and 640 640 for Vi T-large on ADE20K). The batch size is 16 for all datasets with a total iteration of 160k, 80k and 80k for ADE20k, COCO-Stuff-10k and PASCAL-Context respectively.