Regret Minimization Experience Replay in Off-Policy Reinforcement Learning
Authors: Xu-Hui Liu, Zhenghai Xue, Jingcheng Pang, Shengyi Jiang, Feng Xu, Yang Yu
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we conduct experiments to evaluate Re MERN and Re MERT3. ... Both methods outperform previous prioritized sampling algorithms in challenging RL benchmarks, including Mu Jo Co, Atari and Meta-World. |
| Researcher Affiliation | Academia | Xu-Hui Liu , Zhenghai Xue , Jing-Cheng Pang, Shengyi Jiang, Feng Xu, Yang Yu National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China liuxh@lamda.nju.edu.cn, xuezh@smail.nju.edu.cn {pangjc, jiangsy, xufeng}@lamda.nju.edu.cn, yuy@nju.edu.cn |
| Pseudocode | Yes | The pseudo code for Re MERN is presented in Appendix C. ... Its pseudo code is presented in Appendix C. |
| Open Source Code | Yes | Codes are available at https://github.com/AIDefender/Re MERN-Re MERT. |
| Open Datasets | Yes | We first compare the performance of Re MERN and Re MERT to prior sampling methods in continuous control benchmarks including Meta-World [24], Mu Jo Co [25] and Deepmind Control Suite (DMC) [26]. We also evaluate our methods in Arcade Learning Environments with discrete action spaces. |
| Dataset Splits | No | The paper mentions training steps and seeds but does not explicitly describe dataset splits for training, validation, and testing. |
| Hardware Specification | Yes | All experiments are run on NVIDIA GeForce RTX 3090 GPUs. |
| Software Dependencies | Yes | All experiments are run on Ubuntu 20.04.2 LTS, CUDA 11.1, PyTorch 1.9.0. |
| Experiment Setup | Yes | All models are trained for 3M steps. Each task uses 4 seeds. ... The learning rate is set to 3e-4. We use Adam optimizer... The batch size is 256. The replay buffer size is 1M. The training starts after collecting 2000 transitions. The update ratio is 1. For PER, the α is set to 0.6. For DQN, we use the same architecture and hyperparameters as [5] (Nature DQN). |