SQLdepth: Generalizable Self-Supervised Fine-Structured Monocular Depth Estimation

Authors: Youhong Wang, Yunji Liang, Hao Xu, Shaohui Jiao, Hongkai Yu

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on KITTI and Cityscapes show that our method attains remarkable state-of-the-art performance
Researcher Affiliation Collaboration Youhong Wang1, 2, Yunji Liang1*, Hao Xu2, Shaohui Jiao2, Hongkai Yu3 1Northwestern Polytechnical University 2Bytedance Inc 3Cleveland State University
Pseudocode No The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Code is available at https://github.com/hisfog/Sf MNe Xt-Impl.
Open Datasets Yes KITTI (Geiger et al. 2013) is a dataset that provides stereo image sequences, which is commonly used for selfsupervised monocular depth estimation. Cityscapes (Cordts et al. 2016) is a challenging dataset which contains numerous moving objects. Make3D (Saxena, Sun, and Ng 2008) To evaluate the generalization ability of SQLdepth, we use the KITTI-pretrained SQLdepth to perform zero-shot evaluation on the Make3D dataset, and provide additional depth map visualizations.
Dataset Splits No The paper mentions using the 'Eigen test split' for KITTI and refers to standard datasets, but does not provide explicit details on training, validation, and test dataset percentages or sample counts to reproduce the splits. It does not mention a specific validation set split.
Hardware Specification Yes The model is trained on 3 NVIDIA V100 GPUs, with a batch size of 16.
Software Dependencies No The paper states 'Our method is implemented using Pytorch framework (Paszke et al. 2019)' but does not provide specific version numbers for PyTorch or any other software dependencies.
Experiment Setup Yes The model is trained on 3 NVIDIA V100 GPUs, with a batch size of 16. Following the settings from (Godard et al. 2019), we use color and flip augmentations on images during training. We jointly train both Depth Net and Pose Net with the Adam Optimizer (Kingma and Ba 2014) with β1 = 0.9, β2 = 0.999. The initial learning rate is set to 1e 4 and decays to 1e 5 after 15 epochs. We set the SSIM weight to α = 0.85 and smooth loss term weight to λ = 1e 3. We use the Res Net-50 (He et al. 2016) with Image Net (Russakovsky et al. 2015) pretrained weights as backbone, as the other baselines do.