Semantic Visual Navigation by Watching YouTube Videos
Authors: Matthew Chang, Arjun Gupta, Saurabh Gupta
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show results on the Object Goal task in novel environments [3]. Our experiments test the extent to which we are able to learn semantic cues for navigation by watching videos, and how this compares to alternate techniques for learning such cues via direct interaction. |
| Researcher Affiliation | Academia | Matthew Chang Arjun Gupta Saurabh Gupta University of Illinois at Urbana-Champaign {mc48, arjung2, saurabhg}@illinois.edu |
| Pseudocode | No | The paper describes procedural steps and equations (e.g., Q-learning form), but it does not contain any clearly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Project website with code, models, and videos: https://matthewchang.github.io/value-learning-from-videos/. |
| Open Datasets | Yes | We use the Habitat simulator [52] with the Gibson environments [68] (100 training environments from the medium split, and the 5 validation environments from the tiny split). |
| Dataset Splits | Yes | We split the 105 environments into three sets: Etrain, Etest, and Evideo with 15, 5, and 85 environments respectively. |
| Hardware Specification | No | The paper describes the robot model and its sensors, but it does not specify the hardware (e.g., GPU, CPU models, or memory) used for training the models or running the simulations. |
| Software Dependencies | No | The paper mentions various software components and algorithms used, such as 'Res Net-18', 'Mask RCNN', 'Habitat simulator', 'PPO', 'Double DQN', and 'Adam', but it does not provide specific version numbers for any of these dependencies. |
| Experiment Setup | Yes | Inverse model ψ processes RGB images It and It+1 using a Res Net-18 model [29], stacks the resulting convolutional feature maps, and further processes using 2 convolutional layers, and 2 fully connected layers to obtain the final prediction for the intervening action. We use Double DQN ... with Adam [34] for training the Q-networks, and set γ = 0.99. As our reward is bounded between 0 and 1, clipping target value between 0 and 1 led to more stable training. |