Semantically-Guided Representation Learning for Self-Supervised Monocular Depth

Authors: Vitor Guizilini, Rui Hou, Jie Li, Rares Ambrus, Adrien Gaidon

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 EXPERIMENTAL RESULTS: We use the standard KITTI benchmark (Geiger et al., 2013) for self-supervised training and evaluation.
Researcher Affiliation Collaboration 1Toyota Research Institute (TRI) 2University of Michigan {first.last}@tri.global rayhou@umich.edu
Pseudocode No The paper describes its methods in prose and diagrams, but no structured pseudocode or algorithm blocks are explicitly present.
Open Source Code Yes Source code and pretrained models are available on https://github.com/TRI-ML/packnet-sfm
Open Datasets Yes We use the standard KITTI benchmark (Geiger et al., 2013) for self-supervised training and evaluation. ... Following common practice, we pretrain our depth and pose networks on the City Scapes dataset (Cordts et al., 2016), consisting of 88250 unlabeled images.
Dataset Splits Yes This results in 39810 images for training, 4424 for validation, and 697 for evaluation.
Hardware Specification No The paper mentions training with a 'batch size of 4 per GPU' but does not provide specific details on the GPU models, CPU models, or any other hardware specifications used for experiments.
Software Dependencies No The paper states 'We implement our models with Py Torch (Paszke et al., 2017)' but does not provide specific version numbers for PyTorch or any other software dependencies.
Experiment Setup Yes The initial training stage is conducted on the City Scapes dataset for 50 epochs, with a batch size of 4 per GPU and initial depth and pose learning rates of 2 10 4 and 5 10 4 respectively, that are halved every 20 epochs. Afterwards, the depth and pose networks are fine-tuned on KITTI for 30 epochs, with the same parameters and halving the learning rates after every 12 epochs. ... we use a Res Net-50 backbone with Imagenet (Deng et al., 2009) pretrained weights and optimize the network for 48k iterations on the City Scapes dataset with a learning rate of 0.01, momentum of 0.9, weight decay of 10 4, and a batch size of 1 per GPU. Random scaling between (0.7, 1.3), random horizontal flipping, and a crop size of 1000 2000 are used for data augmentation. We decay the learning rate by a factor of 10 at iterations 36k and 44k.