VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation
Authors: Jialu Li, Aishwarya Padmakumar, Gaurav Sukhatme, Mohit Bansal
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results demonstrate that VLN-VIDEO significantly outperforms previous state-of-the-art models by 2.1% in task completion rate, achieving a new state-of-the-art on the Touchdown dataset. |
| Researcher Affiliation | Collaboration | Jialu Li1,2*, Aishwarya Padmakumar2, Gaurav Sukhatme2, Mohit Bansal1 1University of North Carolina, Chapel Hill 2Amazon Alexa AI {jialuli, mbansal}@cs.unc.edu, {padmakua, sukhatme}@amazon.com |
| Pseudocode | No | The paper describes the methods in text and provides an overview diagram (Figure 1), but it does not include any formal pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any statement about releasing source code or a direct link to a code repository. |
| Open Datasets | Yes | We evaluate our agent on the Touchdown dataset (Chen et al. 2019). The Manh50 dataset (Zhu et al. 2021) we use during pre-training is extracted from the Street Learn dataset (Mirowski et al. 2019)... The driving videos we utilized during pre-training come from the BDD100K dataset (Chen et al. 2018). |
| Dataset Splits | Yes | Touchdown is set in Manhattan and contains 9,326 instruction-trajectory pairs, with 6,526 examples in the training set, 1,391 examples in the validation set, and 1,409 examples in the test set. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU model, CPU type, memory) used to run the experiments. |
| Software Dependencies | Yes | We utilize a Mask-RCNN model (He et al. 2017) from the Detectron2 (Wu et al. 2019) package pre-trained on the LVIS dataset (Gupta, Dollar, and Girshick 2019) to detect objects in video frames. ...we additionally compare to using pre-trained BERT-base embeddings (Devlin et al. 2018) for fair comparison. |
| Experiment Setup | No | The paper describes the pre-training tasks (Masked Language Modeling, Instruction and Trajectory Matching, and Next Action Prediction) and the fine-tuning process, but it does not provide specific hyperparameter values (e.g., learning rates, batch sizes, number of epochs) or detailed training configurations. |