Experience Replay Optimization

Authors: Daochen Zha, Kwei-Herng Lai, Kaixiong Zhou, Xia Hu

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The conducted experiments on various continuous control tasks demonstrate the effectiveness of ERO, empirically showing promise in experience replay learning to improve the performance of off-policy reinforcement learning algorithms.
Researcher Affiliation Academia Daochen Zha , Kwei-Herng Lai , Kaixiong Zhou and Xia Hu Department of Computer Science and Engineering, Texas A&M University {daochen.zha, khlai037, zkxiong, xiahu}@tamu.edu
Pseudocode Yes Algorithm 1 ERO enhanced DDPG and Algorithm 2 Update Replay Policy are provided.
Open Source Code No The paper states, 'Our implementations are based on Open AI DDPG baseline 4,' and footnote 4 provides a GitHub link (https://github.com/openai/baselines). However, this refers to a third-party baseline they used, not their own open-sourced code for the methodology described in this paper.
Open Datasets Yes Our experiments are conducted on the following continuous control tasks from Open AI Gym3: Half Cheetah-v2, Inverted Double Pendulum-v2, Hopper-v2, Inverted Pendulum-v2, Humanoid Standup-v2, Reacherv2, Humanoid-v2, Pendulum-v0 [Todorov et al., 2012; Brockman et al., 2016].
Dataset Splits No The paper does not explicitly provide training, validation, or test dataset splits. It only mentions the environments used and that 'Each task is run for 5 times with 2 x 10^6 timesteps using different random seeds'.
Hardware Specification Yes Our experiments are performed on a server with 24 Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.2GHz processors and 4 Ge Force GTX-1080 Ti 12 GB GPU.
Software Dependencies No The paper mentions using 'Open AI DDPG baseline' and 'Adam optimizer' but does not specify exact version numbers for these or other software components.
Experiment Setup Yes Specifically, τ = 0.001 is used for soft target updates, learning rates of 10^-4 and 10^-3 are adopted for actor and critic respectively, the Ornstein-Uhlenbeck noise with θ = 0.15 and σ = 0.2 is used for exploration, the mini-batch size is 64, the replay buffer size is 10^6, the number of rollout steps is 100, and the number of training steps is 50. For our ERO...The number of replay updating steps is set to 1 with minibatch size 64. Adam optimizer is used with a learning rate of 10^-4.