Depth from a Single Image by Harmonizing Overcomplete Local Network Predictions
Authors: Ayan Chakrabarti, Jingyu Shao, Greg Shakhnarovich
NeurIPS 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach on the NYUv2 depth data set [11], and find that it achieves state-of-the-art performance. We train and evaluate our method on the NYU v2 depth dataset [11]. To construct our training and validation sets, we adopt the standard practice of using the raw videos corresponding to the training images from the official train/test split. We randomly select 10% of these videos for validation, and use the rest for training our network. Table 2 reports the quantitative performance of our method, along with other state-of-the-art approaches over the entire test set, and we find that the proposed method yields superior performance on most metrics. |
| Researcher Affiliation | Academia | Ayan Chakrabarti TTI-Chicago Chicago, IL ayanc@ttic.edu Jingyu Shao Dept. of Statistics, UCLA Los Angeles, CA shaojy15@ucla.edu Gregory Shakhnarovich TTI-Chicago Chicago, IL gregory@ttic.edu |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The source code for implementation, along with a pre-trained network model, are available at http://www.ttic.edu/chakrabarti/mdepth. |
| Open Datasets | Yes | We train and evaluate our method on the NYU v2 depth dataset [11]. |
| Dataset Splits | Yes | To construct our training and validation sets, we adopt the standard practice of using the raw videos corresponding to the training images from the official train/test split. We randomly select 10% of these videos for validation, and use the rest for training our network. Our training set is formed by sub-sampling video frames uniformly, and consists of roughly 56,000 color image-depth map pairs. |
| Hardware Specification | Yes | Our overall inference method (network predictions and globalization) takes 24 seconds per-image when using an NVIDIA Titan X GPU. AC and GS thank NVIDIA Corporation for donations of Titan X GPUs used in this research. |
| Software Dependencies | No | The paper mentions using the VGG-19 network and ReLU activations, but does not provide specific version numbers for programming languages, libraries, or other software dependencies. |
| Experiment Setup | Yes | We use a fully convolutional version of our architecture during training with a stride of 8 pixels, yielding nearly 4000 training patches per image. We train the network using SGD for a total of 14 epochs, using a batch size of only one image and a momentum value of 0.9. We begin with a learning rate of 0.01, and reduce it after the 4th, 8th, 10th, 12th, and 13th epochs, each time by a factor of two. |