Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Improving planning and MBRL with temporally-extended actions

Authors: Palash Chatterjee, Roni Khardon

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental An extensive experimental evaluation both in planning and in MBRL, shows that our approach yields faster planning, better solutions, and that it enables solutions to problems that are not solved in the standard formulation.
Researcher Affiliation Academia Palash Chatterjee Indiana University EMAIL Roni Khardon Indiana University EMAIL
Pseudocode Yes Algorithm 1 One decision step : Action selection using a temporally-extended dynamics function (F) with a generic shooting-based planner; Algorithm 2 Using ATE with a fixed δtmax as planner for MBRL
Open Source Code Yes The code can be found at https: // github. com/ pecey/ MBRL-with-TEA/
Open Datasets Yes For experiments, we use the Mountain Car environment from Gymnasium [Kwiatkowski et al., 2024], a multi-hill Mountain Car environment from the Probabilistic and Reinforcement Learning Track of the International Planning Competition (IPC) 2023 [Taitler et al., 2024], and the Dubins car environment from Chatterjee et al. [2023]. Then, we experiment in the MBRL setting where T and R are not known. In this case, we learn a temporally-extended model as discussed above, by interacting with the environment. We use Cartpole from Gymnasium and Ant, Half Cheetah, Hopper, Reacher, Pusher and Walker from Mu Jo Co [Todorov et al., 2012].
Dataset Splits No We collect data by interacting with the environment and use the data to train ˆFTE to predict a distribution over the next states and a point estimate for the reward. The data is collected dynamically by interacting with the environment, and models are trained on this collected data, rather than being split from a static dataset.
Hardware Specification Yes Each MBRL experiment was performed on a single node with a single GPU (using a mix of V100 and A100), single CPU (AMD EPYC 7742) and 64GB of RAM.
Software Dependencies No The environment specifications along with the necessary hyperparameters are described in the Appendix. However, specific version numbers for software dependencies like Gymnasium or Mu Jo Co are not provided.
Experiment Setup Yes Appendix E Experimental Details; Table A2: Hyper-parameters for different environments. Range of action duration is for the fixed variant of the algorithm. Model learning : For MBRL, we use the same model architecture as Chua et al. [2018]. We learn an ensemble of 5 models, where each model is a fully connected neural network. For Ant, Half Cheetah and Hopper, the model has 4 hidden layers while for all other environments, it has 3 hidden layers. Each hidden layer has 200 neurons. The learning rates for each environment are given in Table A2.