FeedFormer: Revisiting Transformer Decoder for Efficient Semantic Segmentation

Authors: Jae-hun Shim, Hyunwoo Yu, Kyeongbo Kong, Suk-Ju Kang

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show the superiority of our decoder with various light-weight transformer-based decoders on popular semantic segmentation datasets. Our model Feed Former-B0 surpasses Seg Former B0 with 1.8% higher m Io U and 7.1% less computation on ADE20K, and 1.7% higher m Io U and 14.4% less computation on Cityscapes, respectively.
Researcher Affiliation Academia 1 Department of Electronic Engineering, Sogang University, Seoul, 04017, Republic of Korea 2 Department of Media School, Pukyong National University, Busan, 48547, Republic of Korea jhshim1995@sogang.ac.kr, hyunwoo137@sogang.ac.kr, kbkong@pknu.ac.kr, sjkang@sogang.ac.kr
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks (e.g., sections or figures explicitly labeled 'Pseudocode' or 'Algorithm').
Open Source Code Yes Code will be released at: https://github.com/jhshim1995/Feed Former.
Open Datasets Yes We conducted experiments on two publicly available datasets, ADE20K (Zhou et al. 2017) and Cityscapes (Cordts et al. 2016). ADE20K is a challenging scene parsing dataset covering 150 fine-grained semantic concepts. It consists of a training set of 20,210 images, a validation set of 2,000 images, and a testing set of 3,352 images. Cityscapes is an urban driving scene dataset for semantic segmentation consisting of 5,000 finely annotated with 19 categories. It contains 5,000 high-resolution images divided into a training set of 2,975 images, a validation set of 500 images and a testing set of 1,525 images.
Dataset Splits Yes ADE20K is a challenging scene parsing dataset covering 150 fine-grained semantic concepts. It consists of a training set of 20,210 images, a validation set of 2,000 images, and a testing set of 3,352 images. Cityscapes is an urban driving scene dataset for semantic segmentation consisting of 5,000 finely annotated with 19 categories. It contains 5,000 high-resolution images divided into a training set of 2,975 images, a validation set of 500 images and a testing set of 1,525 images.
Hardware Specification Yes We used 4 RTX 3090 GPUs for all training throughout the experiments. We tested inference time of a single image of 2048 1024 resolution using a single RTX3090 GPU under the mmsegmentation benchmark without any additional accelerating techniques.
Software Dependencies No The paper mentions using 'the public codebase mmsegmentation' and provides a URL (https://github.com/open-mmlab/mmsegmentation), but it does not specify a version number for mmsegmentation or any other crucial software libraries, frameworks, or programming languages (e.g., Python, PyTorch/TensorFlow versions) that would enable exact reproducibility of the software environment.
Experiment Setup Yes During the training, we applied data augmentation using random resize from 0.5 to 2.0 ratios, random horizontal flipping, and random cropping to 512 512 pixel resolution for ADE20k and 1024 1024 pixel resolution for Cityscapes. We trained the models using Adam W optimizer for 160K iterations. We set the batch size to 16 for ADE20K and 8 for Cityscapes. We set the learning rate to an initial value of 6e-5, and then, used a polynomial learning rate decay schedule with factor 1.0 by default.