Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Bridging Imitation and Online Reinforcement Learning: An Optimistic Tale
Authors: Botao Hao, Rahul Jain, Dengwang Tang, Zheng Wen
TMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical results show that the proposed i RLSVI algorithm is able to achieve significant reduction in regret as compared to two baselines: no offline data, and offline dataset but used without suitably modeling the generative policy. [...] We run the experiments on Deep Sea for T = 300 episodes, and run the experiments on Maze for T = 200 episodes. For both environments, the empirical cumulative regrets are averaged over 50 simulations. The experimental results are illustrated in Figure 2 and 3, as well as Figure 5 in Appendix D.2 and Figure 6 in Appendix D.3. |
| Researcher Affiliation | Collaboration | Botao Hao EMAIL Google Deep Mind Rahul Jain EMAIL University of Southern California Google Deep Mind Dengwang Tang EMAIL University of Southern California Zheng Wen EMAIL Google Deep Mind |
| Pseudocode | Yes | Algorithm 1 i PSRL Input: Prior µ0, Initial state distribution ν for t = 1, , T do (A1) Update posterior, µt(θ|Ht 1, D0) using Bayes rule ... Algorithm 2 RLSVI agents for numerical experiments Input: algorithm parameter σ2 0, σ2 > 0, deliberateness parameter β > 0, offline dataset D0, offline buffer size B, agent type agent ... Algorithm 3 sample ˆQt Input: algorithm parameter σ2 0, σ2 > 0, deliberateness parameter β > 0, offline dataset D0, data buffer D, offline buffer size B, agent type agent |
| Open Source Code | No | The paper does not contain an explicit statement or link indicating that the source code for the methodology described is publicly available. |
| Open Datasets | Yes | Maze (Figure 1 (ii)) is also an episodic reinforcement learning problem, which is a variant of a maze problem proposed in D4RL (Fu et al., 2020). |
| Dataset Splits | No | The paper describes the generation and size of the offline dataset (D0) using parameters like 'κ' and the number of online episodes (T) for interaction, but it does not specify explicit train/test/validation splits for a static dataset in a traditional sense. |
| Hardware Specification | No | The paper describes the experimental setup, including the environments and number of episodes, but does not provide specific details about the hardware (e.g., GPU/CPU models) used to run the experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers (e.g., programming languages, libraries, or frameworks like Python, PyTorch, TensorFlow, etc.). |
| Experiment Setup | Yes | Experimental setup. We now present some empirical results in two prototypical environments: Deep Sea and Maze. Specifically, we compare three variants of the RLSVI agents, which are respectively referred to as informed RLSVI (i RLSVI), partially informed RLSVI (pi RLSVI), and uninformed RLSVI (u RLSVI). All three agents are tabular RLSVI agents with similar posterior sampling-type exploration schemes. [...] We run the experiments on Deep Sea for T = 300 episodes, and run the experiments on Maze for T = 200 episodes. For both environments, the empirical cumulative regrets are averaged over 50 simulations. [...] In all experiments in this paper, we choose the algorithm parameters σ2 0 = 1, σ2 = 0.1, and B = 20. [...] we assume β(s) = β (a constant) across all states. We set the size of the offline dataset D0 as |D0| = κ|A||S|, where κ 0 is referred to as data ratio. |