reproducibilityindex.ai

Off-Policy Reinforcement Learning with Delayed Rewards

Authors: Beining Han, Zhizhou Ren, Zuofan Wu, Yuan Zhou, Jian Peng

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We finally conduct extensive experiments to demonstrate the superior performance of our algorithms over the existing work and their variants.
Researcher Affiliation	Collaboration	1Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China 2Department of Computer Science, University of Illinois Urbana-Champaign, Illinois, United States 3Helixon Limited, Beijing, China 4Yau Mathematical Sciences Center, Tsinghua University, Beijing, China 5Institute for Industry AI Research, Tsinghua University, Beijing, China. Correspondence to: Jian Peng <jianpeng@illinois.edu>.
Pseudocode	Yes	Algorithm 1 Algorithm with General Reward Function
Open Source Code	No	The paper does not explicitly state that its code is open-source or provide a link to a repository for its specific methodology.
Open Datasets	Yes	All high dimensional experiments are based on Open AI Gym (Greg et al., 2016) with Mu Jo Co200.
Dataset Splits	No	The paper mentions training on mini-batches and using a replay buffer, but it does not specify explicit training/validation/test dataset splits with percentages or counts.
Hardware Specification	Yes	All experiments are trained on Ge Force GTX 1080 Ti and Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.
Software Dependencies	No	The paper mentions using "Open AI baselines" and "Adam" optimizer, but it does not provide specific version numbers for these or other software libraries (e.g., Python, PyTorch, CUDA versions) used in their implementation.
Experiment Setup	Yes	The corresponding hyperparameters are shown in Table 2. We use a smaller batch size due to limited computation power. Table 2. Shared Hyperparameters with SAC Hyperparameter SAC Qϕ, πθ architecture 2 hidden-layer MLPs with 256 units each non-linearity Re LU batch size 128 discount factor γ 0.99 optimizer Adam (Kingma & Ba, 2014) learning rate 3 10 4 entropy target -\|A\| target smoothing τ 0.005 replay buffer large enough for 1M samples target update interval 1 gradient steps 1