Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

RDD: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks

Authors: Mingxuan Yan, Yuping Wang, Zechun Liu, Jiachen Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	RDD outperforms the state-of-the-art sub-task decomposer on both simulation and real-world tasks, demonstrating robustness across diverse settings. Code and more results are available at rdd-neurips.github.io. 4 Experiments
Researcher Affiliation	Collaboration	1University of California, Riverside 2University of Michigan 3Meta AI EMAIL EMAIL
Pseudocode	Yes	A.1 Dynamic Programming Solver to Problem 3.1 Algorithm 1 shows the dynamic programming solver. Lmax and Lmin are user-specified parameters that determine the minimum and maximum length of proposed sub-task intervals. J is the interval scoring function. Algorithm 1 Max Sum Partition Require: Sequence u = [u1, u2, . . . , un], scoring function J, integer Lmin, integer Lmax Ensure: Maximum score sum and partition of u 1: Initialize dp[0 . . . n] , parts[0 . . . n] 2: dp[0] 0 3: for i = Lmin + 1 to n do 4: best Score 5: best Partition 6: for j = 0 to i do 7: if Lmin i j Lmax then 8: segment u[j : i] 9: s J(segment) can be evaluated in parallel before loops 10: if dp[j] + s > best Score then 11: best Score dp[j] + s 12: best Partition parts[j] {segment} 13: end if 14: end if 15: end for 16: if best Partition = then 17: dp[i] best Score 18: parts[i] best Partition 19: else 20: dp[i] dp[i 1] 21: parts[i] parts[i 1] 22: end if 23: end for 24: return (dp[n], parts[n])
Open Source Code	Yes	Code and more results are available at rdd-neurips.github.io.
Open Datasets	Yes	We evaluate RDD on the RLBench [32] robot manipulation benchmark. The visuomotor policy training set Dtrain aug is adapted from [13]. Dtrain originally consists of 1908 teleoperated demonstrations from the RLBench s training set. We first evaluate RDD on the real-world manipulation benchmark Agi Bot World Alpha [33]. For OOD sub-tasks, we test RDD on the humanoperated demonstration dataset from Robo Cerebra [34], which features highly diverse demonstrations in terms of objects, task goals, and arrangements.
Dataset Splits	Yes	The finetuning dataset Ddemo aug is built on RLBench s validation set following the same procedure except that the decomposition strategy is replaced by RDD. Each task has three demonstrations. For OOD sub-tasks, we test RDD on the humanoperated demonstration dataset from Robo Cerebra [34]...We use 560 demos to build the RDD database and test on the remaining 140 demos.
Hardware Specification	Yes	The finetuning process takes about 5 minutes with 4 NVIDIA 6000 Ada GPUs. We test the running time of Algorithm 1 with different numbers of frames on AMD EPYC 9254 using one CPU core. Performance of FAISS nearest neighbor search and RDD time on NVIDIA 4090.
Software Dependencies	No	We adopt RACER [13] as the base hierarchical VLA framework, which uses RVT [39] as the low-level visuomotor policy πθ and the recent LLaVa-based VLM llama3-llava-next-8B [40] as the pre-trained base model for planner pϕ. We use the pre-trained RVT policy πθ provided by RACER [13] trained Dtrain aug and the validation set of RLBench (labeled with the same decomposition rule as in Dtrain aug ). During the deployment phase, the planner is finetuned for two epochs on Ddemo aug using Lo RA [41], with the rank of 128 and a scaling factor of 256 following RACER.
Experiment Setup	Yes	The finetuning process takes about 5 minutes with 4 NVIDIA 6000 Ada GPUs. For base parameter settings, we set the weighting factor α = 1 and interval similarity measure sim in Eq. 3.6 for non-OOD scenarios, and use LIV [26] as the visual encoder E that is specifically designed for manipulation tasks. We use Gemini-1.5-flash [42] to generate sub-task language instructions for proposed intervals in Ddemo aug .