Near-optimal Conservative Exploration in Reinforcement Learning under Episode-wise Constraints

Authors: Donghao Li, Ruiquan Huang, Cong Shen, Jing Yang

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiment results corroborate the theoretical analysis and demonstrate the effectiveness of the proposed conservative exploration strategies.
Researcher Affiliation Academia 1School of EECS, The Pennsylvania State University, University Park, PA, USA 2ECE Department, University of Virginia, Charlottesville, VA, USA. Correspondence to: Jing Yang <yangjing@psu.edu>.
Pseudocode Yes Algorithm 1 The Step Mix Algorithm; Algorithm 2 The Step Mix Algorithm; Algorithm 3 Policy Eva Subroutine
Open Source Code No The paper does not provide a statement about open-sourcing the code or a link to a code repository.
Open Datasets No The paper describes generating a 'synthetic environment' for evaluation, rather than using a publicly available dataset with concrete access information. 'We generate a synthetic environment to evaluate the proposed algorithms. We set the number of states S to be 5, the number of actions A for each state to be 5, and the episode length H to be 3. The reward rh(s, a) for each state-action pair and each step is generated independently and uniformly at random from [0, 1]. We also generate the transition kernel Ph( |s, a) from an S-dimensional simplex independently and uniformly at random. Such procedure guarantees that the synthetic environment is a proper tabular MDP.'
Dataset Splits No The paper mentions running '10 trials' and plotting 'the average expected return per episode' but does not specify train/validation/test splits for any dataset.
Hardware Specification No The paper does not specify any hardware details (e.g., GPU/CPU models, memory) used for running the experiments. It only mentions 'synthetic experiments'.
Software Dependencies No The paper does not list any specific software dependencies with version numbers for reproducibility.
Experiment Setup Yes We adopt the Boltzmann policy (Thrun, 1992) as the baseline policy in our algorithms. Under the Boltzmann policy, actions are taken randomly according to πh(a|s) = exp{ηQ h(s,a)} P a A exp{ηQ h(s,a)}, where a larger η leads to a more deterministic policy and higher expected value. ... In Figure 1, we track the expected return obtained in each episode with different baseline parameter η and conservative constraint γ. ... We use the baseline Boltzmann policy with η = 10 and η = 15 to collect the offline dataset. The numbers of offline trajectories are set to be 5000 and 8000, respectively. The conservative constraint γ is set to be 2.2. ... We set Ph = P and rh = r for any h [H], and randomly generate P an r as in Section 7.1. ... We adopt the Boltzmann policy from Section 7.1 as the baseline policy and set η to be 5.