Depth-Relative Self Attention for Monocular Depth Estimation
Authors: Kyuhong Shim, Jiyoung Kim, Gusang Lee, Byonghyo Shim
IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that the proposed model achieves competitive results in monocular depth estimation benchmarks and is less biased to RGB information. In addition, we propose a novel monocular depth estimation benchmark that limits the observable depth range during training in order to evaluate the robustness of the model for unseen depths. |
| Researcher Affiliation | Academia | Kyuhong Shim , Jiyoung Kim , Gusang Lee and Byonghyo Shim Department of Electrical and Computer Engineering, Seoul National University, Korea {khshim, jykim, gslee, bshim}@islab.snu.ac.kr |
| Pseudocode | No | The paper describes the architecture and process of RED-T, including 'The detailed process of each cycle is as follows:', but it does not present any formal pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any statement about releasing source code or a link to a code repository. |
| Open Datasets | Yes | NYU-v2 [Silberman et al., 2012] dataset includes pairs of RGB images and depth maps on 464 indoor scenes, which are separated into 120K training samples from 249 scenes and 654 testing samples from 215 scenes. The range of depth labels is up to 10 meters. We train our model on 50K subset following previous work [Yuan et al., 2022]. KITTI [Geiger et al., 2013] dataset consists of paired RGB images and corresponding depth maps obtained by a 3D laser scanner on 61 outdoor scenes while driving. The range of depth annotations is up to 80 meters. |
| Dataset Splits | Yes | First, following the Eigen split setting [Eigen et al., 2014], we train our model with about 26K samples from 32 scenes and test on 687 samples from 29 scenes. Second, for the online depth prediction configuration [Geiger et al., 2012], we use 72K training samples, 6K validation samples, and 500 testing samples. |
| Hardware Specification | Yes | We train our model with a batch size of 16 for 24 epochs on 8 NVIDIA A5000 24GB GPUs. |
| Software Dependencies | No | The paper mentions using 'Adam W optimizer [Kingma and Ba, 2014]' and 'Swin Transformer (Swin) [Liu et al., 2021]' but does not provide specific version numbers for software dependencies or libraries. |
| Experiment Setup | Yes | We use Adam W optimizer [Kingma and Ba, 2014] with a learning rate of 1e-4, (β1, β2) of (0.9, 0.999), and a weight decay of 0.1. The learning rate starts at 4e-6, increases to the maximum value for 25% of the total iterations, and then decreases to 1e-6. We train our model with a batch size of 16 for 24 epochs on 8 NVIDIA A5000 24GB GPUs. The gradient is accumulated every 2 batches and clipped to the maximum gradient norm of 0.1. |