Constrained Reinforcement Learning Under Model Mismatch
Authors: Zhongchang Sun, Sihong He, Fei Miao, Shaofeng Zou
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To validate the proposed algorithm, we compare it with several baseline algorithms (PCPO (Yang et al., 2019), RVI (Iyengar, 2005), CPO (Achiam et al., 2017), R3C (Mankowitz et al., 2020) and CUP (Yang et al., 2022)) in the setting of tabular and deep cases, while using different environments such as the gambler problem (Sutton & Barto, 2018; Zhou et al., 2021; Shi & Chi, 2022), the N-chain problem (Wang et al., 2022), the Frozen-Lake problem (Brockman et al., 2016) and the Point Gather in Mujoco (Achiam et al., 2017; Yang et al., 2019). ... For each problem, we run the algorithms for 5 independent times and plot the mean of the reward and utility along with their standard deviation as a function of the number of iterations. |
| Researcher Affiliation | Academia | 1Department of Electrical Engineering, University at Buffalo, New York, USA 2School of Computing, University of Connecticut, Storrs, USA 3Department of Computer Science & Engineering, University at Buffalo, New York, USA. |
| Pseudocode | Yes | Algorithm 1 Robust Constrained Policy Optimization |
| Open Source Code | No | The paper states that experiments are implemented in 'rllab', a third-party tool, but does not provide a link or statement for the open-sourcing of their own RCPO algorithm's code. |
| Open Datasets | Yes | We compare it with several baseline algorithms (...) using different environments such as the gambler problem (Sutton & Barto, 2018; Zhou et al., 2021; Shi & Chi, 2022), the N-chain problem (Wang et al., 2022), the Frozen-Lake problem (Brockman et al., 2016) and the Point Gather in Mujoco (Achiam et al., 2017; Yang et al., 2019). |
| Dataset Splits | No | The paper mentions environments and a batch size for training, but does not specify how the datasets for these environments were split into training, validation, and test sets. It does not provide percentages, sample counts, or refer to standard splits with sufficient detail for reproduction. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions 'rllab (Duan et al., 2016)' but does not provide specific version numbers for rllab or any other software dependencies needed for reproducibility. |
| Experiment Setup | Yes | We use the following hyper-parameters for training RCPO: discounted factor = 0.995, learning step size = 0.001, batch size = 50,000, and utility-constrained threshold = 0.1. To provide fair comparisons, we use the same hyper-parameters for training baseline algorithms. |