Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
World-aware Planning Narratives Enhance Large Vision-Language Model Planner
Authors: Junhao Shi, Zhaoye Fei, Siyin Wang, Qipeng Guo, Jingjing Gong, Xipeng Qiu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluations on the EB-ALFRED benchmark demonstrate substantial improvements, with Qwen2.5VL achieving a 60.7 absolute improvement in task success rates particularly in commonsense reasoning (+60.0) and long-horizon planning (+70.0). ... We conduct comprehensive experiments to evaluate our framework s effectiveness in enhancing embodied planning capabilities. Our analysis focuses on both overall performance metrics and fine-grained cognitive capabilities across diverse task contexts. |
| Researcher Affiliation | Academia | Junhao Shi1,2 , Zhaoye Fei1*, Siyin Wang1,2, Qipeng Guo2,3, Jingjing Gong2 , Xipeng Qiu1,2 1Fudan University 2Shanghai Innovation Institute 3Shanghai AI Laboratory EMAIL, EMAIL EMAIL, EMAIL |
| Pseudocode | No | The paper describes its methodology using textual descriptions and a framework overview in Figure 1, but it does not include a clearly labeled pseudocode block or algorithm. |
| Open Source Code | Yes | Our code is provided in supplementary materials and will release soon. |
| Open Datasets | Yes | We construct an enhanced corpus comprising 80,875 instruction-trajectory pairs derived from the original 16,145 ALFRED trajectories... We evaluate on the EB-ALFRED benchmark from Embodied Bench [25]... We use the dataset from ALFRED (https://github.com/askforalfred/alfred) and evaluation benchmark from Embodiedbench.(https://github.com/Embodied Bench/Embodied Bench) |
| Dataset Splits | No | The paper mentions deriving an enhanced corpus from original ALFRED trajectories and evaluating on the EB-ALFRED benchmark, but it does not specify the training, validation, and test splits (e.g., percentages or exact counts) used for its experiments. |
| Hardware Specification | Yes | distributed across 8 A100 80GB nodes via tensor parallelism. ... The complete training cycle requires 14 hours per model variant, aggregating to 800 A100 GPU-hours... Instruction augmentation 4 H100 20 ... Reasoning augmentation 4 H100 200 ... Training 8 H100 100 ... All inference-time results were measured with tensor parallelism on 2 GTX 3090 GPUs. |
| Software Dependencies | No | The paper mentions using Adam W optimization, Flash Attention v2, and BF16 mixed-precision training. However, it does not provide specific version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages. |
| Experiment Setup | Yes | The training regime employs Adam W optimization with base learning rate η = 1e-5, 10% linear warmup, and cosine decay scheduling over 3 epochs. Experiments utilize contrastive context windows (16k/32k tokens) with per-device batch size 4, distributed across 8 A100 80GB nodes via tensor parallelism. |