Near-optimal Conservative Exploration in Reinforcement Learning under Episode-wise Constraints
Authors: Donghao Li, Ruiquan Huang, Cong Shen, Jing Yang
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiment results corroborate the theoretical analysis and demonstrate the effectiveness of the proposed conservative exploration strategies. |
| Researcher Affiliation | Academia | 1School of EECS, The Pennsylvania State University, University Park, PA, USA 2ECE Department, University of Virginia, Charlottesville, VA, USA. Correspondence to: Jing Yang <yangjing@psu.edu>. |
| Pseudocode | Yes | Algorithm 1 The Step Mix Algorithm; Algorithm 2 The Step Mix Algorithm; Algorithm 3 Policy Eva Subroutine |
| Open Source Code | No | The paper does not provide a statement about open-sourcing the code or a link to a code repository. |
| Open Datasets | No | The paper describes generating a 'synthetic environment' for evaluation, rather than using a publicly available dataset with concrete access information. 'We generate a synthetic environment to evaluate the proposed algorithms. We set the number of states S to be 5, the number of actions A for each state to be 5, and the episode length H to be 3. The reward rh(s, a) for each state-action pair and each step is generated independently and uniformly at random from [0, 1]. We also generate the transition kernel Ph( |s, a) from an S-dimensional simplex independently and uniformly at random. Such procedure guarantees that the synthetic environment is a proper tabular MDP.' |
| Dataset Splits | No | The paper mentions running '10 trials' and plotting 'the average expected return per episode' but does not specify train/validation/test splits for any dataset. |
| Hardware Specification | No | The paper does not specify any hardware details (e.g., GPU/CPU models, memory) used for running the experiments. It only mentions 'synthetic experiments'. |
| Software Dependencies | No | The paper does not list any specific software dependencies with version numbers for reproducibility. |
| Experiment Setup | Yes | We adopt the Boltzmann policy (Thrun, 1992) as the baseline policy in our algorithms. Under the Boltzmann policy, actions are taken randomly according to πh(a|s) = exp{ηQ h(s,a)} P a A exp{ηQ h(s,a)}, where a larger η leads to a more deterministic policy and higher expected value. ... In Figure 1, we track the expected return obtained in each episode with different baseline parameter η and conservative constraint γ. ... We use the baseline Boltzmann policy with η = 10 and η = 15 to collect the offline dataset. The numbers of offline trajectories are set to be 5000 and 8000, respectively. The conservative constraint γ is set to be 2.2. ... We set Ph = P and rh = r for any h [H], and randomly generate P an r as in Section 7.1. ... We adopt the Boltzmann policy from Section 7.1 as the baseline policy and set η to be 5. |