Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
PanoGen: Text-Conditioned Panoramic Environment Generation for Vision-and-Language Navigation
Authors: Jialu Li, Mohit Bansal
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, learning with our PANOGEN environments achieves the new state-of-the-art on the Room-to-Room, Room-for Room, and CVDN datasets. |
| Researcher Affiliation | Academia | Jialu Li Mohit Bansal UNC Chapel Hill EMAIL |
| Pseudocode | No | The paper describes the PANOGEN method using textual descriptions and figures but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides a project website URL (https://pano-gen.github.io) but does not explicitly state that source code for the described methodology is released or provide a direct link to a code repository within the paper's text. |
| Open Datasets | Yes | We evaluate our agent on three datasets: Room-to-Room dataset (R2R) [2], Cooperative Vision-and-Dialog Navigation dataset (CVDN) [52], and Room-for-Room dataset (R4R) [21]. |
| Dataset Splits | Yes | The training set contains 61 different room environments, while the unseen validation set and test set contains 11, and 18 room environments that are unseen during training. |
| Hardware Specification | Yes | It takes 2 days on 6 A100s to generate all the environments. ... We train the speaker for 4 epochs on one A6000 GPU... We train the model on one A6000 GPU. |
| Software Dependencies | Yes | We caption all the view images in the training environments in R2R dataset with BLIP-2-Flan T5-xx L. We utilize stable-diffusion-v2.1 base model to generate the single view based on caption only, and use stable-diffusion-v1.5-inpainting model to outpaint the unseen observation for the rotated views. ... We build our speaker model based on m PLUG-base. |
| Experiment Setup | Yes | We train the speaker for 4 epochs on one A6000 GPU with batch size 16 for two days. ... We pre-train the agent with batch size 64 for 150k iterations, and then fine-tune the agent with batch size 8 for 40k iterations. |