Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Online Reinforcement Learning in Non-Stationary Context-Driven Environments

Authors: Pouya Hamadanian, Arash Nasr-Esfahany, Malte Schwarzkopf, Siddhartha Sen, Mohammad Alizadeh

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate LCPO in Mujoco, classic control and computer systems environments with a variety of synthetic and real context traces, and find that it outperforms a variety of baselines in the non-stationary setting...
Researcher Affiliation Collaboration Pouya Hamadanian MIT CSAIL EMAIL Arash Nasr-Esfahany MIT CSAIL EMAIL Malte Schwarzkopf CS Brown University EMAIL Siddhartha Sen Microsoft Research EMAIL Mohammad Alizadeh MIT CSAIL EMAIL
Pseudocode Yes Algorithm 1 LCPO Training 1: initialize parameter vectors θ0, empty buffer Ba 2: for each iteration do 3: Br Sample a mini-batch of new interactions 4: Sc W(Ba,Br) 5: v θLtot(θ;Br)|θ0 6: if Sc is not empty then 7: g(x):= θ(x T θDKL(θold,θ;Sc)|θ0)|θ0 8: vc conjgrad(v,g( )) 9: while θold+vc violates constraints do 10: vc vc/2 11: θ0 θ0+vc 12: else 13: θ0 θ0+v 14: Ba Ba+Br
Open Source Code Yes LCPO s source code is available at https://github.com/pouyahmdn/LCPO.
Open Datasets Yes We consider six environments: Modified versions of Pendulum-v1 from the classic control environments, Inverted Pendulum-v4, Inverted Double Pendulum-v4, Hopper-v4 and Reacher-v4 from the Mujoco environments (Towers et al., 2023), and a straggler mitigation environment (Hamadanian et al., 2022).
Dataset Splits No The paper describes experimental procedures and data generation (e.g., 'warm-up period of 6 million time steps', 'Context traces 1 and 2 are 20 million...'), but does not specify explicit training/test/validation dataset splits typically found in static dataset evaluations. In online reinforcement learning, data is generated sequentially, rather than being pre-split.
Hardware Specification Yes These experiments were conducted on a machine with 2 AMD EPYC 7763 CPUs (256 logical cores) and 512 Gi B of RAM. With 32 concurrent runs, experiments finished in 1152 hours.
Software Dependencies Yes We use Gymnasium (v0.29.1, MIT license) and Mujoco (v3.1.1, Apache-2.0 license). Our baseline and LCPO implementations use the Pytorch (Paszke et al., 2019) (v1.13.1, BSD-style license) library.
Experiment Setup Yes Table 11 is a comprehensive list of all hyperparameters used in training and the environment.