Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Distributionally Robust Model-Based Offline Reinforcement Learning with Near-Optimal Sample Complexity
Authors: Laixi Shi, Yuejie Chi
JMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on the gambler s problem (Sutton and Barto, 2018; Zhou et al., 2021) to evaluate the performance of the proposed algorithm DRVI-LCB, with comparisons to the robust value iteration algorithm DRVI without pessimism (Panaganti and Kalathil, 2022). Our code can be accessed at: https://github.com/Laixishi/Robust-RL-with-KL-divergence. ... Figure 1(a) plots the sub-optimality value gap ... Figure 1(b) shows the sub-optimality gap ... Figure 1(c) illustrates the ratio of winning ... Figure 1(d) show that DRVI-LCB performs consistently better than DRVI ... Figure 2 shows the sub-optimality value gap with respect to the number of trajectories K... |
| Researcher Affiliation | Academia | Laixi Shi EMAIL Computing Mathematical Sciences California Institute of Technology Pasadena, CA, 91125, USA Yuejie Chi EMAIL Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh, PA, 15213, USA |
| Pseudocode | Yes | Algorithm 1: Two-fold subsampling trick for the finite-horizon setting. ... Algorithm 2: Robust value iteration with LCB (DRVI-LCB) for robust offline RL. ... Algorithm 3: Robust value iteration with LCB (DRVI-LCB) for infinite-horizon RMDPs. |
| Open Source Code | Yes | Our code can be accessed at: https://github.com/Laixishi/Robust-RL-with-KL-divergence. |
| Open Datasets | Yes | We conduct experiments on the gambler s problem (Sutton and Barto, 2018; Zhou et al., 2021) to evaluate the performance of the proposed algorithm DRVI-LCB |
| Dataset Splits | No | The paper describes generating its own |
| Hardware Specification | No | The paper does not provide any specific hardware details used for running the experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers. |
| Experiment Setup | Yes | Gambler s problem. ... with a state space S = {0, 1, , 50} and the associated possible actions a 0, 1, , min{s, 50 s} at state s. Here, we set the horizon length H = 100. ... We evaluate the performance of the learned policy bπ using our proposed method DRVI-LCB with comparison to DRVI without pessimism, where we fix the uncertainty level σ = 0.1 for learning the robust optimal policy. ... Figure 1(b) shows the sub-optimality gap V ,σ 1 (ρ) V bπ,σ 1 (ρ) with varying sample sizes N = 100, 300, 1000, 3000, 5000... |