Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learning Parameterized Skills from Demonstrations

Authors: Vedant Gupta, Haotian Fu, Calvin Luo, Yiding Jiang, George Konidaris

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the efficacy of DEPS in learning parameterized skills across two challenging multitask environments: LIBERO [16] and Meta World-v2 [30]. Our primary focus is on the rapid generalization capabilities of the learned skills, assessing their ability to adapt to novel tasks through finetuning with limited data. We demonstrate significant quantitative performance improvements over prior work in low-data regimes and provide qualitative visualizations of learned skills corresponding to fundamental actions like grasping, moving, and releasing objects. We find that DEPS consistently achieves the highest average success rate across various pretraining settings compared to existing methods, underscoring its ability to learn flexible and high-performing skills.
Researcher Affiliation	Academia	1Brown University 2Carnegie Mellon University
Pseudocode	No	The paper describes the DEPS framework and its components in Section 4 and Appendix A, outlining the architecture and flow of information. However, it does not include a clearly labeled 'Pseudocode' or 'Algorithm' block, nor does it present structured steps in a code-like format outside of paragraph text.
Open Source Code	Yes	Website: sites.google.com/view/parameterized-skills Code: github.com/guptbot/DEPS Correspondence: EMAIL
Open Datasets	Yes	We evaluate the efficacy of DEPS in learning parameterized skills across two challenging multitask environments: LIBERO [16] and Meta World-v2 [30].
Dataset Splits	Yes	To pre-train model architectures, we use 80 tasks from LIBERO-90 using the offline dataset provided by Liu et al. [16]. For each of the 80 pretraining tasks, the dataset provides 50 demonstrations collected using human tele-operation. [...] LIBERO-OOD: 10 unseen tasks from LIBERO-90 [...] Each task comes with 50 expert demonstrations. LIBERO-10: The standard LIBERO evaluation dataset, [...] Each task comes with 50 expert demonstrations LIBERO-3-shot: This dataset consists of the tasks in LIBERO-OOD but with only 3 demonstrations per task, testing the ability to successfully learn new tasks with minimal data. [...] To evaluate performance on Meta World-v2, we utilize the provided expert scripted policies provided in Meta World, collecting 50 demonstration trajectories for each task. [...] We pretrain each method on a set of 10 tasks, performing 40 passes over the training data. We then evaluate the performance of pretrained checkpoints on two different evaluation sets described below (information on the specific tasks used for pretraining and finetuning can be found in Appendix D).
Hardware Specification	Yes	Batch sizes are chosen to fit a single GPU during pretraining and finetuning. This results in a batch size of 3 during pretraining and a batch size of 2 during finetuning for LIBERO, and batch sizes of 8 during pretraining and 3 during finetuning (i.e. all of the data as we do 3-shot finetuning) for Meta World.
Software Dependencies	No	The paper mentions using a 'Resnet-18 image encoder' and 'CNN image encoder' for baselines, and 'LSTM' or 'GRU' for network architectures. However, it does not provide specific version numbers for these software components or any underlying frameworks like PyTorch or TensorFlow, which are necessary for reproducible software dependency information.
Experiment Setup	Yes	We use a learning rate of 3e 4 and the random seeds 95, 96, 97, 98, and 99. Batch sizes are chosen to fit a single GPU during pretraining and finetuning. This results in a batch size of 3 during pretraining and a batch size of 2 during finetuning for LIBERO, and batch sizes of 8 during pretraining and 3 during finetuning (i.e. all of the data as we do 3-shot finetuning) for Meta World. [...] The variational network contains a two-layer bidirectional GRU with a hidden size of 1024. The individual heads for the discrete and continuous parameters both have two layers each, with a hidden size of 1024. [...] In LIBERO we weigh the KL divergence terms corresponding to the discrete skills and continuous variables by 0.5 and 0.01, respectively. The skill parameter norm penalty is weighted by 0.1. For Meta World, we scale these hyperparameter down to 0.1, 0.03, and 0.03 respectively, to account for the lower loss magnitudes (due to using a deterministic policy head instead of a GMM).