Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Ground-Compose-Reinforce: Grounding Language in Agentic Behaviours using Limited Data

Authors: Andrew C. Li, Toryn Klassen, Andrew Wang, Parand A. Alamdari, Sheila A. McIlraith

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments in a custom Meta-World domain with only 350 labelled pretraining trajectories show that our framework faithfully elicits complex behaviours from high-level specifications including behaviours that never appear in pretraining while non-compositional approaches fail.
Researcher Affiliation Academia Andrew C. Li, Toryn Q. Klassen , Andrew Wang, Parand A. Alamdari , Sheila A. Mc Ilraith Department of Computer Science, University of Toronto Vector Institute for Artificial Intelligence Schwartz Reisman Institute for Technology and Society Toronto, Canada EMAIL
Pseudocode Yes Algorithm 1 Ground-Compose-Reinforce for RMs Algorithm 2 Value Iteration over RM States (Modified from Camacho et al. [34]) Algorithm 3 Value Iteration over RM States (modified for self-loop transitions) Algorithm 4 Learning PVF V 3x(s) for x AP from Trajectory Data D Algorithm 5 Ground-Compose-Reinforce for RMs with Potential-Based Reward Shaping
Open Source Code Yes Code/videos available at: https://github.com/andrewli77/ground-compose-reinforce.
Open Datasets No In Geo Grid, D contains 5000 trajectories generated by a random policy. In Drawer World, we manually operated the robot to collect 350 trajectories involving generic behaviours (e.g., opening drawers, lifting boxes).
Dataset Splits Yes We tuned hyperparameters via a line search over batch size, learning rate, L1 regularization coefficient, and epochs (in that order) using a held-out 10% of the trajectories in D, and the final hyperparameters are reported in Table 6. Final models were retrained on the full data with the tuned hyperparameters.
Hardware Specification No All experiments were run on a compute cluster. Each RL run used a single GPU, 16 CPUs, and 48GB of RAM.
Software Dependencies No We used the implementation of PPO at https://github.com/lcswillems/torch-ac with the hyperparameters in Table 7.
Experiment Setup Yes Table 6: Hyperparameters for Supervised Training on D. (Geo Grid and Drawer World details follow) Table 7: RL Training Hyperparameters. (Values for all methods provided)