DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation

Authors: Bowen Yin, Xuying Zhang, Zhong-Yu Li, Li Liu, Ming-Ming Cheng, Qibin Hou

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We finetune the pretrained DFormer on two popular RGB-D tasks, i.e., RGB-D semantic segmentation and RGB-D salient object detection, with a lightweight decoder head. Experimental results show that our DFormer achieves new state-of-the-art performance on these two tasks with less than half of the computational cost of the current best methods on two RGB-D semantic segmentation datasets and five RGB-D salient object detection datasets.
Researcher Affiliation Academia Bowen Yin1 Xuying Zhang1 Zhongyu Li1 Li Liu2 Ming-Ming Cheng1,3 Qibin Hou1,3 1 VCIP, School of Computer Science, Nankai University 2 National University of Defense Technology 3 Nankai International Advanced Research Institute (Shenzhen Futian)
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes Our code is available at: https://github.com/VCIP-RGBD/DFormer.
Open Datasets Yes we first apply a depth estimator, e.g., Adabin (Bhat et al., 2021), on the Image Net-1K dataset (Russakovsky et al., 2015) to generate a large number of image-depth pairs. ... NYUDepthv2 (Silberman et al., 2012) and SUN-RGBD (Song et al., 2015). ... The finetuning dataset consists of 2,195 samples, where 1,485 are from NJU2K-train (Ju et al., 2014) and the other 700 samples are from NLPR-train (Peng et al., 2014).
Dataset Splits Yes NYUDepthv2 (Silberman et al., 2012) contains 1,449 RGB-D samples covering 40 categories, where the resolution of all RGB images and depth maps is unified as 480 640. Particularly, 795 image-depth pairs are used to train the RGB-D model, and the remaining 654 are utilized for testing. SUN-RGBD (Song et al., 2015) includes 10,335 RGB-D images with 530 730 resolution, where the objects are in 37 categories. All samples of this dataset are divided into 5,285 and 5,050 splits for training and testing, respectively.
Hardware Specification Yes All the pretraining experiments are conducted on 8 NVIDIA 3090 GPUs. All the finetuning experiments for RGB-D semantic segmenations are conducted on 2 NVIDIA 3090 GPUs.
Software Dependencies No The paper mentions optimizers like 'Adam W' and losses like 'Cross-entropy loss', but does not specify software versions for programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or specific libraries.
Experiment Setup Yes Table 17: DFormer Image Net-1K pretraining settings. (e.g., input size 224x224, optimizer Adam W, base learning rate 1e-3, weight decay 0.05, batch size 1024, training epochs 300, warmup epochs 5). Table 18: DFormer finetuning settings on NYUDepthv2/SUNRGBD. (e.g., input size 480x640 / 480x480, base learning rate 6e-5/8e-5, weight decay 0.01, batch size 8/16, epochs 500/300).