Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

Authors: Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Frank Wang, Fu-En Yang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on embodied reasoning and robot manipulation benchmarks demonstrate that Think Act enables few-shot adaptation, long-horizon planning, and self-correction behaviors in complex embodied AI tasks.
Researcher Affiliation	Industry	1 NVIDIA 2 National Taiwan University EMAIL
Pseudocode	No	The paper describes methods using text and mathematical formulations but does not present any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	We plan to release the source code after acceptance.
Open Datasets	Yes	Training Datasets and Evaluation Benchmarks For SFT cold-start, we fine-tune the MLLM using trajectories from the subset of OXE, and QA tasks from Robo VQA [38], Ego Plan-IT [7], and Video-R1-Co T [12]. During RL training, we incorporate trajectories from the OXE subset and human videos from Something-Something v2 [13]. To enhance general reasoning capability, we include embodied QA datasets such as Ego Plan-IT/Val [7], Robo VQA [38], and the Reflect dataset [26], as well as a general video instruction dataset, i.e., LLa VA-Video-178K [53]. We evaluate Think Act on two robot manipulation and three embodied reasoning benchmarks. For manipulation tasks, Simpler Env [20] containing diverse scenes and LIBERO [24] with long-horizon tasks are evaluated using task success rate. For reasoning benchmarks, Ego Plan-Bench2 [35] uses accuracy on multiple-choice questions, while Robo VQA [38] and Open EQA [29] are free-form QA tasks evaluated using BLEU score [34] and LLM-based scoring, respectively, following their original protocols.
Dataset Splits	Yes	Specifically, for each task, all methods are evaluated across 500 trials, resulting in a total of 1500 evaluation trials per reported statistic. [...] We fine-tune the action model on just 10 demonstrations per task and evaluate performance over 100 trials.
Hardware Specification	Yes	All experiments are conducted on 16 NVIDIA A100 GPUs with 80 GB memory.
Software Dependencies	No	The paper mentions software like Deep Speed Ze RO-3 but does not provide specific version numbers for any key software components or libraries like Python, PyTorch, or CUDA.
Experiment Setup	Yes	We initialize Fθ with Qwen2.5-VL 7B [2]. The cold-start stage runs for 20K iterations with batch size 32 and learning rate 1e 5 using Deep Speed Ze RO-3. We then apply GRPO [39] for 6K iterations, using batch size 64, learning rate 1e 6, and rollout size 5. [...] For reasoning-enhanced action adaptation, we connect the visual plan ct via a Q-Former [18] as the latent projector with 32 queries and fine-tune on 100K data randomly sampled from the OXE dataset for 120K iterations using batch size 256 and learning rate 2e 5. LIBERO [24] tasks are further fine-tuned for 75K iterations with batch size 128.