Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Online Reinforcement Learning in Non-Stationary Context-Driven Environments
Authors: Pouya Hamadanian, Arash Nasr-Esfahany, Malte Schwarzkopf, Siddhartha Sen, Mohammad Alizadeh
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate LCPO in Mujoco, classic control and computer systems environments with a variety of synthetic and real context traces, and find that it outperforms a variety of baselines in the non-stationary setting... |
| Researcher Affiliation | Collaboration | Pouya Hamadanian MIT CSAIL EMAIL Arash Nasr-Esfahany MIT CSAIL EMAIL Malte Schwarzkopf CS Brown University EMAIL Siddhartha Sen Microsoft Research EMAIL Mohammad Alizadeh MIT CSAIL EMAIL |
| Pseudocode | Yes | Algorithm 1 LCPO Training 1: initialize parameter vectors θ0, empty buffer Ba 2: for each iteration do 3: Br Sample a mini-batch of new interactions 4: Sc W(Ba,Br) 5: v θLtot(θ;Br)|θ0 6: if Sc is not empty then 7: g(x):= θ(x T θDKL(θold,θ;Sc)|θ0)|θ0 8: vc conjgrad(v,g( )) 9: while θold+vc violates constraints do 10: vc vc/2 11: θ0 θ0+vc 12: else 13: θ0 θ0+v 14: Ba Ba+Br |
| Open Source Code | Yes | LCPO s source code is available at https://github.com/pouyahmdn/LCPO. |
| Open Datasets | Yes | We consider six environments: Modified versions of Pendulum-v1 from the classic control environments, Inverted Pendulum-v4, Inverted Double Pendulum-v4, Hopper-v4 and Reacher-v4 from the Mujoco environments (Towers et al., 2023), and a straggler mitigation environment (Hamadanian et al., 2022). |
| Dataset Splits | No | The paper describes experimental procedures and data generation (e.g., 'warm-up period of 6 million time steps', 'Context traces 1 and 2 are 20 million...'), but does not specify explicit training/test/validation dataset splits typically found in static dataset evaluations. In online reinforcement learning, data is generated sequentially, rather than being pre-split. |
| Hardware Specification | Yes | These experiments were conducted on a machine with 2 AMD EPYC 7763 CPUs (256 logical cores) and 512 Gi B of RAM. With 32 concurrent runs, experiments finished in 1152 hours. |
| Software Dependencies | Yes | We use Gymnasium (v0.29.1, MIT license) and Mujoco (v3.1.1, Apache-2.0 license). Our baseline and LCPO implementations use the Pytorch (Paszke et al., 2019) (v1.13.1, BSD-style license) library. |
| Experiment Setup | Yes | Table 11 is a comprehensive list of all hyperparameters used in training and the environment. |