PanoGen: Text-Conditioned Panoramic Environment Generation for Vision-and-Language Navigation
Authors: Jialu Li, Mohit Bansal
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, learning with our PANOGEN environments achieves the new state-of-the-art on the Room-to-Room, Room-for Room, and CVDN datasets. |
| Researcher Affiliation | Academia | Jialu Li Mohit Bansal UNC Chapel Hill {jialuli, mbansal}@cs.unc.edu |
| Pseudocode | No | The paper describes the PANOGEN method using textual descriptions and figures but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides a project website URL (https://pano-gen.github.io) but does not explicitly state that source code for the described methodology is released or provide a direct link to a code repository within the paper's text. |
| Open Datasets | Yes | We evaluate our agent on three datasets: Room-to-Room dataset (R2R) [2], Cooperative Vision-and-Dialog Navigation dataset (CVDN) [52], and Room-for-Room dataset (R4R) [21]. |
| Dataset Splits | Yes | The training set contains 61 different room environments, while the unseen validation set and test set contains 11, and 18 room environments that are unseen during training. |
| Hardware Specification | Yes | It takes 2 days on 6 A100s to generate all the environments. ... We train the speaker for 4 epochs on one A6000 GPU... We train the model on one A6000 GPU. |
| Software Dependencies | Yes | We caption all the view images in the training environments in R2R dataset with BLIP-2-Flan T5-xx L. We utilize stable-diffusion-v2.1 base model to generate the single view based on caption only, and use stable-diffusion-v1.5-inpainting model to outpaint the unseen observation for the rotated views. ... We build our speaker model based on m PLUG-base. |
| Experiment Setup | Yes | We train the speaker for 4 epochs on one A6000 GPU with batch size 16 for two days. ... We pre-train the agent with batch size 64 for 150k iterations, and then fine-tune the agent with batch size 8 for 40k iterations. |