Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

World-aware Planning Narratives Enhance Large Vision-Language Model Planner

Authors: Junhao Shi, Zhaoye Fei, Siyin Wang, Qipeng Guo, Jingjing Gong, Xipeng Qiu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluations on the EB-ALFRED benchmark demonstrate substantial improvements, with Qwen2.5VL achieving a 60.7 absolute improvement in task success rates particularly in commonsense reasoning (+60.0) and long-horizon planning (+70.0). ... We conduct comprehensive experiments to evaluate our framework s effectiveness in enhancing embodied planning capabilities. Our analysis focuses on both overall performance metrics and fine-grained cognitive capabilities across diverse task contexts.
Researcher Affiliation	Academia	Junhao Shi1,2 , Zhaoye Fei1*, Siyin Wang1,2, Qipeng Guo2,3, Jingjing Gong2 , Xipeng Qiu1,2 1Fudan University 2Shanghai Innovation Institute 3Shanghai AI Laboratory EMAIL, EMAIL EMAIL, EMAIL
Pseudocode	No	The paper describes its methodology using textual descriptions and a framework overview in Figure 1, but it does not include a clearly labeled pseudocode block or algorithm.
Open Source Code	Yes	Our code is provided in supplementary materials and will release soon.
Open Datasets	Yes	We construct an enhanced corpus comprising 80,875 instruction-trajectory pairs derived from the original 16,145 ALFRED trajectories... We evaluate on the EB-ALFRED benchmark from Embodied Bench [25]... We use the dataset from ALFRED (https://github.com/askforalfred/alfred) and evaluation benchmark from Embodiedbench.(https://github.com/Embodied Bench/Embodied Bench)
Dataset Splits	No	The paper mentions deriving an enhanced corpus from original ALFRED trajectories and evaluating on the EB-ALFRED benchmark, but it does not specify the training, validation, and test splits (e.g., percentages or exact counts) used for its experiments.
Hardware Specification	Yes	distributed across 8 A100 80GB nodes via tensor parallelism. ... The complete training cycle requires 14 hours per model variant, aggregating to 800 A100 GPU-hours... Instruction augmentation 4 H100 20 ... Reasoning augmentation 4 H100 200 ... Training 8 H100 100 ... All inference-time results were measured with tensor parallelism on 2 GTX 3090 GPUs.
Software Dependencies	No	The paper mentions using Adam W optimization, Flash Attention v2, and BF16 mixed-precision training. However, it does not provide specific version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages.
Experiment Setup	Yes	The training regime employs Adam W optimization with base learning rate η = 1e-5, 10% linear warmup, and cosine decay scheduling over 3 epochs. Experiments utilize contrastive context windows (16k/32k tokens) with per-device batch size 4, distributed across 8 A100 80GB nodes via tensor parallelism.