DFormer: Rethinking RGBD Representation Learning for Semantic Segmentation
Authors: Bowen Yin, Xuying Zhang, Zhong-Yu Li, Li Liu, Ming-Ming Cheng, Qibin Hou
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We finetune the pretrained DFormer on two popular RGB-D tasks, i.e., RGB-D semantic segmentation and RGB-D salient object detection, with a lightweight decoder head. Experimental results show that our DFormer achieves new state-of-the-art performance on these two tasks with less than half of the computational cost of the current best methods on two RGB-D semantic segmentation datasets and five RGB-D salient object detection datasets. |
| Researcher Affiliation | Academia | Bowen Yin1 Xuying Zhang1 Zhongyu Li1 Li Liu2 Ming-Ming Cheng1,3 Qibin Hou1,3 1 VCIP, School of Computer Science, Nankai University 2 National University of Defense Technology 3 Nankai International Advanced Research Institute (Shenzhen Futian) |
| Pseudocode | No | No pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | Our code is available at: https://github.com/VCIP-RGBD/DFormer. |
| Open Datasets | Yes | we first apply a depth estimator, e.g., Adabin (Bhat et al., 2021), on the Image Net-1K dataset (Russakovsky et al., 2015) to generate a large number of image-depth pairs. ... NYUDepthv2 (Silberman et al., 2012) and SUN-RGBD (Song et al., 2015). ... The finetuning dataset consists of 2,195 samples, where 1,485 are from NJU2K-train (Ju et al., 2014) and the other 700 samples are from NLPR-train (Peng et al., 2014). |
| Dataset Splits | Yes | NYUDepthv2 (Silberman et al., 2012) contains 1,449 RGB-D samples covering 40 categories, where the resolution of all RGB images and depth maps is unified as 480 640. Particularly, 795 image-depth pairs are used to train the RGB-D model, and the remaining 654 are utilized for testing. SUN-RGBD (Song et al., 2015) includes 10,335 RGB-D images with 530 730 resolution, where the objects are in 37 categories. All samples of this dataset are divided into 5,285 and 5,050 splits for training and testing, respectively. |
| Hardware Specification | Yes | All the pretraining experiments are conducted on 8 NVIDIA 3090 GPUs. All the finetuning experiments for RGB-D semantic segmenations are conducted on 2 NVIDIA 3090 GPUs. |
| Software Dependencies | No | The paper mentions optimizers like 'Adam W' and losses like 'Cross-entropy loss', but does not specify software versions for programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or specific libraries. |
| Experiment Setup | Yes | Table 17: DFormer Image Net-1K pretraining settings. (e.g., input size 224x224, optimizer Adam W, base learning rate 1e-3, weight decay 0.05, batch size 1024, training epochs 300, warmup epochs 5). Table 18: DFormer finetuning settings on NYUDepthv2/SUNRGBD. (e.g., input size 480x640 / 480x480, base learning rate 6e-5/8e-5, weight decay 0.01, batch size 8/16, epochs 500/300). |