Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Ground-Compose-Reinforce: Grounding Language in Agentic Behaviours using Limited Data
Authors: Andrew C. Li, Toryn Klassen, Andrew Wang, Parand A. Alamdari, Sheila A. McIlraith
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments in a custom Meta-World domain with only 350 labelled pretraining trajectories show that our framework faithfully elicits complex behaviours from high-level specifications including behaviours that never appear in pretraining while non-compositional approaches fail. |
| Researcher Affiliation | Academia | Andrew C. Li, Toryn Q. Klassen , Andrew Wang, Parand A. Alamdari , Sheila A. Mc Ilraith Department of Computer Science, University of Toronto Vector Institute for Artificial Intelligence Schwartz Reisman Institute for Technology and Society Toronto, Canada EMAIL |
| Pseudocode | Yes | Algorithm 1 Ground-Compose-Reinforce for RMs Algorithm 2 Value Iteration over RM States (Modified from Camacho et al. [34]) Algorithm 3 Value Iteration over RM States (modified for self-loop transitions) Algorithm 4 Learning PVF V 3x(s) for x AP from Trajectory Data D Algorithm 5 Ground-Compose-Reinforce for RMs with Potential-Based Reward Shaping |
| Open Source Code | Yes | Code/videos available at: https://github.com/andrewli77/ground-compose-reinforce. |
| Open Datasets | No | In Geo Grid, D contains 5000 trajectories generated by a random policy. In Drawer World, we manually operated the robot to collect 350 trajectories involving generic behaviours (e.g., opening drawers, lifting boxes). |
| Dataset Splits | Yes | We tuned hyperparameters via a line search over batch size, learning rate, L1 regularization coefficient, and epochs (in that order) using a held-out 10% of the trajectories in D, and the final hyperparameters are reported in Table 6. Final models were retrained on the full data with the tuned hyperparameters. |
| Hardware Specification | No | All experiments were run on a compute cluster. Each RL run used a single GPU, 16 CPUs, and 48GB of RAM. |
| Software Dependencies | No | We used the implementation of PPO at https://github.com/lcswillems/torch-ac with the hyperparameters in Table 7. |
| Experiment Setup | Yes | Table 6: Hyperparameters for Supervised Training on D. (Geo Grid and Drawer World details follow) Table 7: RL Training Hyperparameters. (Values for all methods provided) |