Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Bridging Imitation and Online Reinforcement Learning: An Optimistic Tale

Authors: Botao Hao, Rahul Jain, Dengwang Tang, Zheng Wen

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical results show that the proposed i RLSVI algorithm is able to achieve significant reduction in regret as compared to two baselines: no offline data, and offline dataset but used without suitably modeling the generative policy. [...] We run the experiments on Deep Sea for T = 300 episodes, and run the experiments on Maze for T = 200 episodes. For both environments, the empirical cumulative regrets are averaged over 50 simulations. The experimental results are illustrated in Figure 2 and 3, as well as Figure 5 in Appendix D.2 and Figure 6 in Appendix D.3.
Researcher Affiliation	Collaboration	Botao Hao EMAIL Google Deep Mind Rahul Jain EMAIL University of Southern California Google Deep Mind Dengwang Tang EMAIL University of Southern California Zheng Wen EMAIL Google Deep Mind
Pseudocode	Yes	Algorithm 1 i PSRL Input: Prior µ0, Initial state distribution ν for t = 1, , T do (A1) Update posterior, µt(θ\|Ht 1, D0) using Bayes rule ... Algorithm 2 RLSVI agents for numerical experiments Input: algorithm parameter σ2 0, σ2 > 0, deliberateness parameter β > 0, offline dataset D0, offline buffer size B, agent type agent ... Algorithm 3 sample ˆQt Input: algorithm parameter σ2 0, σ2 > 0, deliberateness parameter β > 0, offline dataset D0, data buffer D, offline buffer size B, agent type agent
Open Source Code	No	The paper does not contain an explicit statement or link indicating that the source code for the methodology described is publicly available.
Open Datasets	Yes	Maze (Figure 1 (ii)) is also an episodic reinforcement learning problem, which is a variant of a maze problem proposed in D4RL (Fu et al., 2020).
Dataset Splits	No	The paper describes the generation and size of the offline dataset (D0) using parameters like 'κ' and the number of online episodes (T) for interaction, but it does not specify explicit train/test/validation splits for a static dataset in a traditional sense.
Hardware Specification	No	The paper describes the experimental setup, including the environments and number of episodes, but does not provide specific details about the hardware (e.g., GPU/CPU models) used to run the experiments.
Software Dependencies	No	The paper does not specify any software dependencies with version numbers (e.g., programming languages, libraries, or frameworks like Python, PyTorch, TensorFlow, etc.).
Experiment Setup	Yes	Experimental setup. We now present some empirical results in two prototypical environments: Deep Sea and Maze. Specifically, we compare three variants of the RLSVI agents, which are respectively referred to as informed RLSVI (i RLSVI), partially informed RLSVI (pi RLSVI), and uninformed RLSVI (u RLSVI). All three agents are tabular RLSVI agents with similar posterior sampling-type exploration schemes. [...] We run the experiments on Deep Sea for T = 300 episodes, and run the experiments on Maze for T = 200 episodes. For both environments, the empirical cumulative regrets are averaged over 50 simulations. [...] In all experiments in this paper, we choose the algorithm parameters σ2 0 = 1, σ2 = 0.1, and B = 20. [...] we assume β(s) = β (a constant) across all states. We set the size of the offline dataset D0 as \|D0\| = κ\|A\|\|S\|, where κ 0 is referred to as data ratio.