Off-Policy Reinforcement Learning with Delayed Rewards

Authors: Beining Han, Zhizhou Ren, Zuofan Wu, Yuan Zhou, Jian Peng

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We finally conduct extensive experiments to demonstrate the superior performance of our algorithms over the existing work and their variants.
Researcher Affiliation Collaboration 1Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China 2Department of Computer Science, University of Illinois Urbana-Champaign, Illinois, United States 3Helixon Limited, Beijing, China 4Yau Mathematical Sciences Center, Tsinghua University, Beijing, China 5Institute for Industry AI Research, Tsinghua University, Beijing, China. Correspondence to: Jian Peng <jianpeng@illinois.edu>.
Pseudocode Yes Algorithm 1 Algorithm with General Reward Function
Open Source Code No The paper does not explicitly state that its code is open-source or provide a link to a repository for its specific methodology.
Open Datasets Yes All high dimensional experiments are based on Open AI Gym (Greg et al., 2016) with Mu Jo Co200.
Dataset Splits No The paper mentions training on mini-batches and using a replay buffer, but it does not specify explicit training/validation/test dataset splits with percentages or counts.
Hardware Specification Yes All experiments are trained on Ge Force GTX 1080 Ti and Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.
Software Dependencies No The paper mentions using "Open AI baselines" and "Adam" optimizer, but it does not provide specific version numbers for these or other software libraries (e.g., Python, PyTorch, CUDA versions) used in their implementation.
Experiment Setup Yes The corresponding hyperparameters are shown in Table 2. We use a smaller batch size due to limited computation power. Table 2. Shared Hyperparameters with SAC Hyperparameter SAC Qϕ, πθ architecture 2 hidden-layer MLPs with 256 units each non-linearity Re LU batch size 128 discount factor γ 0.99 optimizer Adam (Kingma & Ba, 2014) learning rate 3 10 4 entropy target -|A| target smoothing τ 0.005 replay buffer large enough for 1M samples target update interval 1 gradient steps 1