Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Near-optimal Conservative Exploration in Reinforcement Learning under Episode-wise Constraints
Authors: Donghao Li, Ruiquan Huang, Cong Shen, Jing Yang
ICML 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiment results corroborate the theoretical analysis and demonstrate the effectiveness of the proposed conservative exploration strategies. |
| Researcher Affiliation | Academia | 1School of EECS, The Pennsylvania State University, University Park, PA, USA 2ECE Department, University of Virginia, Charlottesville, VA, USA. Correspondence to: Jing Yang <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 The Step Mix Algorithm; Algorithm 2 The Step Mix Algorithm; Algorithm 3 Policy Eva Subroutine |
| Open Source Code | No | The paper does not provide a statement about open-sourcing the code or a link to a code repository. |
| Open Datasets | No | The paper describes generating a 'synthetic environment' for evaluation, rather than using a publicly available dataset with concrete access information. 'We generate a synthetic environment to evaluate the proposed algorithms. We set the number of states S to be 5, the number of actions A for each state to be 5, and the episode length H to be 3. The reward rh(s, a) for each state-action pair and each step is generated independently and uniformly at random from [0, 1]. We also generate the transition kernel Ph( |s, a) from an S-dimensional simplex independently and uniformly at random. Such procedure guarantees that the synthetic environment is a proper tabular MDP.' |
| Dataset Splits | No | The paper mentions running '10 trials' and plotting 'the average expected return per episode' but does not specify train/validation/test splits for any dataset. |
| Hardware Specification | No | The paper does not specify any hardware details (e.g., GPU/CPU models, memory) used for running the experiments. It only mentions 'synthetic experiments'. |
| Software Dependencies | No | The paper does not list any specific software dependencies with version numbers for reproducibility. |
| Experiment Setup | Yes | We adopt the Boltzmann policy (Thrun, 1992) as the baseline policy in our algorithms. Under the Boltzmann policy, actions are taken randomly according to πh(a|s) = exp{ηQ h(s,a)} P a A exp{ηQ h(s,a)}, where a larger η leads to a more deterministic policy and higher expected value. ... In Figure 1, we track the expected return obtained in each episode with different baseline parameter η and conservative constraint γ. ... We use the baseline Boltzmann policy with η = 10 and η = 15 to collect the offline dataset. The numbers of offline trajectories are set to be 5000 and 8000, respectively. The conservative constraint γ is set to be 2.2. ... We set Ph = P and rh = r for any h [H], and randomly generate P an r as in Section 7.1. ... We adopt the Boltzmann policy from Section 7.1 as the baseline policy and set η to be 5. |