Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Distributionally Robust Model-Based Offline Reinforcement Learning with Near-Optimal Sample Complexity

Authors: Laixi Shi, Yuejie Chi

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on the gambler s problem (Sutton and Barto, 2018; Zhou et al., 2021) to evaluate the performance of the proposed algorithm DRVI-LCB, with comparisons to the robust value iteration algorithm DRVI without pessimism (Panaganti and Kalathil, 2022). Our code can be accessed at: https://github.com/Laixishi/Robust-RL-with-KL-divergence. ... Figure 1(a) plots the sub-optimality value gap ... Figure 1(b) shows the sub-optimality gap ... Figure 1(c) illustrates the ratio of winning ... Figure 1(d) show that DRVI-LCB performs consistently better than DRVI ... Figure 2 shows the sub-optimality value gap with respect to the number of trajectories K...
Researcher Affiliation Academia Laixi Shi EMAIL Computing Mathematical Sciences California Institute of Technology Pasadena, CA, 91125, USA Yuejie Chi EMAIL Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh, PA, 15213, USA
Pseudocode Yes Algorithm 1: Two-fold subsampling trick for the finite-horizon setting. ... Algorithm 2: Robust value iteration with LCB (DRVI-LCB) for robust offline RL. ... Algorithm 3: Robust value iteration with LCB (DRVI-LCB) for infinite-horizon RMDPs.
Open Source Code Yes Our code can be accessed at: https://github.com/Laixishi/Robust-RL-with-KL-divergence.
Open Datasets Yes We conduct experiments on the gambler s problem (Sutton and Barto, 2018; Zhou et al., 2021) to evaluate the performance of the proposed algorithm DRVI-LCB
Dataset Splits No The paper describes generating its own
Hardware Specification No The paper does not provide any specific hardware details used for running the experiments.
Software Dependencies No The paper does not specify any software dependencies with version numbers.
Experiment Setup Yes Gambler s problem. ... with a state space S = {0, 1, , 50} and the associated possible actions a 0, 1, , min{s, 50 s} at state s. Here, we set the horizon length H = 100. ... We evaluate the performance of the learned policy bπ using our proposed method DRVI-LCB with comparison to DRVI without pessimism, where we fix the uncertainty level σ = 0.1 for learning the robust optimal policy. ... Figure 1(b) shows the sub-optimality gap V ,σ 1 (ρ) V bπ,σ 1 (ρ) with varying sample sizes N = 100, 300, 1000, 3000, 5000...