Depth Anything V2

Authors: Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Table 2: Zero-shot relative depth estimation. Better: Abs Rel , δ1 . Solely from the metrics, Depth Anything V2 is better than Mi Da S, but merely comparable with V1.
Researcher Affiliation Collaboration 1HKU 2Tik Tok project lead corresponding author
Pseudocode No The paper describes its methods in text and uses figures to illustrate concepts, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes https://depth-anything-v2.github.io
Open Datasets Yes As shown in Table 7, we use five precise synthetic datasets (595K images) and eight large-scale pseudo-labeled real datasets (62M images) for training.
Dataset Splits No The paper mentions training on specific datasets and evaluating on test sets, but it does not explicitly describe validation dataset splits (e.g., percentages, sample counts, or specific predefined validation sets) used during training or hyperparameter tuning.
Hardware Specification No Figure 1 mentions 'latency (V100)' in the context of benchmarking inference speed, but the paper does not explicitly describe the specific hardware (e.g., GPU models, CPUs, or memory) used for training or running the main experiments.
Software Dependencies No The paper mentions using 'DPT as our depth decoder' and 'DINOv2 encoders', and the 'Adam optimizer', but it does not specify exact version numbers for these software libraries, frameworks, or any other ancillary software dependencies.
Experiment Setup Yes Follow Depth Anything V1 [89], we use DPT [55] as our depth decoder, built on DINOv2 encoders. All images are trained at the resolution of 518 518 by resizing the shorter size to 518 followed by a random crop. When training the teacher model on synthetic images, we use a batch size of 64 for 160K iterations. In the third stage of training on pseudo-labeled real images, the model is trained with a batch size of 192 for 480K iterations. We use the Adam optimizer and set the learning rate of the encoder and the decoder as 5e-6 and 5e-5, respectively. In both training stages, we do not balance the training datasets, but simply concatenate them. The weight ratio of Lssi and Lgm is set as 1:2.