Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Zero-Shot Context Generalization in Reinforcement Learning from Few Training Contexts

Authors: James Chapman, Kedar Karhadkar, Guido F. Montufar

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we perform numerical experiments to test CEBE and CSE. For continuous control problems, we use a simple feed-forward network and train with Soft Actor Critic (SAC) (Haarnoja et al., 2018). The inputs to the neural network are states with the context appended on (i.e., (s, c)). We compare against baseline training (i.e., vanilla SAC) and local domain randomization (LDR).
Researcher Affiliation Academia James Chapman Department of Mathematics University of California, Los Angeles Los Angeles, CA 90095 EMAIL Kedar Karhadkar Department of Mathematics University of California, Los Angeles Los Angeles, CA 90095 EMAIL Guido Montúfar Departments of Mathematics and of Statistics & Data Science University of California, Los Angeles Los Angeles, CA 90095 EMAIL
Pseudocode Yes Algorithm 1 Off-policy RL algorithm with context sample enhancement 1: Given: CMDP M, training contexts Dtrain, data collection iterations N, train iterations M, perturbation radius ϵ, and off-policy RL algorithm ALGO. 2: Initialize policy π, value functions Q, and replay buffer B. 3: Collect some number of trajectories from a random policy in CMDP M (c) with c Dtrain
Open Source Code Yes 1Code: https://github.com/chapman20j/Zero Shot Generalization-CMDPs.
Open Datasets Yes We consider the tabular Cliffwalking from gymnasium (Towers et al., 2024), in which an agent must navigate a grid-world to a goal state without first falling off the cliff into a terminal state. For this section, we use the goal-based Mu Jo Co environments Cheetah Velocity and Ant Direction introduced in the work of Lee and Chung (2021).
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits in the traditional sense. It discusses training with contexts and evaluating on unseen contexts, which is common in RL for generalization, but does not specify fixed data splits from a pre-existing dataset. Data is generated dynamically through environment interaction. For example: "In this paper, we consider the out-of-distribution setting where the train and test context distributions, Dtrain and Dtest, have different supports."
Hardware Specification Yes Hardware Experiments were run on a system with Intel(R) Xeon(R) Gold 6152 CPUs @ 2.10GHz and NVIDIA Ge Force RTX 2080 Ti GPUs.
Software Dependencies Yes Python libraries Torch (Paszke et al., 2019); Ray (Moritz et al., 2018); Ray Tune (Liaw et al., 2018); Ray RLlib (Liang et al., 2018, 2021); Seaborn (Waskom, 2021); Matplotlib (Hunter, 2007); Pandas (pandas development team, 2020); Numpy (Harris et al., 2020); Scipy (Gommers et al., 2024); Mujoco (Todorov et al., 2012); Gymnasium (Towers et al., 2024).
Experiment Setup Yes We include an overview of the SAC hyperparameters in Table 2 and replay buffer hyperparameters in Table 3. For the SAC experiments, all neural networks use three fully connected layers of width 256 with Re LU activations. The target entropy is automatically tuned by RLLib. The policy polyak averaging coefficient is 5 10 3. We use a training batch size of 256 and 4000 environment steps per training epoch. For Cart Goal, we also have the following hyperparameters with DQN. The learning rate is 5 10 4 and the train batch size is 32.