reproducibilityindex.ai

Constrained Reinforcement Learning Under Model Mismatch

Authors: Zhongchang Sun, Sihong He, Fei Miao, Shaofeng Zou

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To validate the proposed algorithm, we compare it with several baseline algorithms (PCPO (Yang et al., 2019), RVI (Iyengar, 2005), CPO (Achiam et al., 2017), R3C (Mankowitz et al., 2020) and CUP (Yang et al., 2022)) in the setting of tabular and deep cases, while using different environments such as the gambler problem (Sutton & Barto, 2018; Zhou et al., 2021; Shi & Chi, 2022), the N-chain problem (Wang et al., 2022), the Frozen-Lake problem (Brockman et al., 2016) and the Point Gather in Mujoco (Achiam et al., 2017; Yang et al., 2019). ... For each problem, we run the algorithms for 5 independent times and plot the mean of the reward and utility along with their standard deviation as a function of the number of iterations.
Researcher Affiliation	Academia	1Department of Electrical Engineering, University at Buffalo, New York, USA 2School of Computing, University of Connecticut, Storrs, USA 3Department of Computer Science & Engineering, University at Buffalo, New York, USA.
Pseudocode	Yes	Algorithm 1 Robust Constrained Policy Optimization
Open Source Code	No	The paper states that experiments are implemented in 'rllab', a third-party tool, but does not provide a link or statement for the open-sourcing of their own RCPO algorithm's code.
Open Datasets	Yes	We compare it with several baseline algorithms (...) using different environments such as the gambler problem (Sutton & Barto, 2018; Zhou et al., 2021; Shi & Chi, 2022), the N-chain problem (Wang et al., 2022), the Frozen-Lake problem (Brockman et al., 2016) and the Point Gather in Mujoco (Achiam et al., 2017; Yang et al., 2019).
Dataset Splits	No	The paper mentions environments and a batch size for training, but does not specify how the datasets for these environments were split into training, validation, and test sets. It does not provide percentages, sample counts, or refer to standard splits with sufficient detail for reproduction.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments.
Software Dependencies	No	The paper mentions 'rllab (Duan et al., 2016)' but does not provide specific version numbers for rllab or any other software dependencies needed for reproducibility.
Experiment Setup	Yes	We use the following hyper-parameters for training RCPO: discounted factor = 0.995, learning step size = 0.001, batch size = 50,000, and utility-constrained threshold = 0.1. To provide fair comparisons, we use the same hyper-parameters for training baseline algorithms.