reproducibilityindex.ai

Near-optimal Conservative Exploration in Reinforcement Learning under Episode-wise Constraints

Authors: Donghao Li, Ruiquan Huang, Cong Shen, Jing Yang

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiment results corroborate the theoretical analysis and demonstrate the effectiveness of the proposed conservative exploration strategies.
Researcher Affiliation	Academia	1School of EECS, The Pennsylvania State University, University Park, PA, USA 2ECE Department, University of Virginia, Charlottesville, VA, USA. Correspondence to: Jing Yang <yangjing@psu.edu>.
Pseudocode	Yes	Algorithm 1 The Step Mix Algorithm; Algorithm 2 The Step Mix Algorithm; Algorithm 3 Policy Eva Subroutine
Open Source Code	No	The paper does not provide a statement about open-sourcing the code or a link to a code repository.
Open Datasets	No	The paper describes generating a 'synthetic environment' for evaluation, rather than using a publicly available dataset with concrete access information. 'We generate a synthetic environment to evaluate the proposed algorithms. We set the number of states S to be 5, the number of actions A for each state to be 5, and the episode length H to be 3. The reward rh(s, a) for each state-action pair and each step is generated independently and uniformly at random from [0, 1]. We also generate the transition kernel Ph( \|s, a) from an S-dimensional simplex independently and uniformly at random. Such procedure guarantees that the synthetic environment is a proper tabular MDP.'
Dataset Splits	No	The paper mentions running '10 trials' and plotting 'the average expected return per episode' but does not specify train/validation/test splits for any dataset.
Hardware Specification	No	The paper does not specify any hardware details (e.g., GPU/CPU models, memory) used for running the experiments. It only mentions 'synthetic experiments'.
Software Dependencies	No	The paper does not list any specific software dependencies with version numbers for reproducibility.
Experiment Setup	Yes	We adopt the Boltzmann policy (Thrun, 1992) as the baseline policy in our algorithms. Under the Boltzmann policy, actions are taken randomly according to πh(a\|s) = exp{ηQ h(s,a)} P a A exp{ηQ h(s,a)}, where a larger η leads to a more deterministic policy and higher expected value. ... In Figure 1, we track the expected return obtained in each episode with different baseline parameter η and conservative constraint γ. ... We use the baseline Boltzmann policy with η = 10 and η = 15 to collect the offline dataset. The numbers of offline trajectories are set to be 5000 and 8000, respectively. The conservative constraint γ is set to be 2.2. ... We set Ph = P and rh = r for any h [H], and randomly generate P an r as in Section 7.1. ... We adopt the Boltzmann policy from Section 7.1 as the baseline policy and set η to be 5.