Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Forecasting in Offline Reinforcement Learning for Non-stationary Environments

Authors: Suzan Ece Ada, Georg Martius, Emre Ugur, Erhan Oztop

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluations on ofﬂine RL benchmarks, augmented with real-world time-series data to simulate realistic non-stationarity, demonstrate that FORL consistently improves performance compared to competitive baselines.
Researcher Affiliation	Academia	1Bogazici University, Türkiye 2University of Tübingen, Germany 3Ozyegin University, Türkiye 4Osaka University, Japan
Pseudocode	Yes	Algorithm 1 Candidate Selection
Open Source Code	No	We plan to provide open access to code in the future.
Open Datasets	Yes	We evaluate FORL across navigation and manipulation tasks in D4RL [15] and OGBench [21] ofﬂine RL environments, each augmented with ﬁve real-world non-stationarity domains sourced from [22].
Dataset Splits	Yes	Training (Ofﬂine Stationary MDP) We begin with an episodic, stationary Markov Decision Process (MDP) Mtrain = (S, A, T , R, 0), where the initial state distribution 0 is a uniform distribution over the state space S. We only have access to an ofﬂine RL dataset D = {(sk t )} with k transitions collected from this MDP. Crucially, our FORL diffusion model and a diffusion policy [14] are trained ofﬂine using this dataset, such as the standard D4RL benchmark [15], without making any assumptions about how the environment might become non-stationary at test time.
Hardware Specification	No	The numerical calculations reported in this paper were partially performed at TUBITAK ULAKBIM, High Performance and Grid Computing Center (TRUBA resources).
Software Dependencies	No	The paper does not explicitly provide specific version numbers for key software components such as Python, PyTorch, or CUDA used for their implementation. It only references a library in a citation ([22] Gluon TS) but not its own development stack.
Experiment Setup	Yes	We use the noise prediction model [11] with the reverse diffusion chain s(n 1) t formulated as (n)(1 (n)) (s(n) t , (t,w), n) + p1 (n) where N(0, I) for n = N, . . . , 1, and = 0 for n = 1 [11]. ... and the weighting factors (n) = N )+(βmax βmin) 2n 1 2N2 ) where βmax = 10 and βmin = 0.1 are parameters introduced for empirical reasons [19]. ... Results average 5 seeds, unless noted.