Sample Efficient Actor-Critic with Experience Replay
Authors: Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, Nando de Freitas
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper presents an actor-critic deep reinforcement learning agent with experience replay that is stable, sample efficient, and performs remarkably well on challenging environments, including the discrete 57-game Atari domain and several continuous control problems. We use the Arcade Learning Environment of Bellemare et al. (2013) to conduct an extensive evaluation. |
| Researcher Affiliation | Collaboration | Ziyu Wang Deep Mind ziyu@google.com Victor Bapst Deep Mind vbapst@google.com Nicolas Heess Deep Mind heess@google.com Volodymyr Mnih Deep Mind vmnih@google.com Remi Munos Deep Mind Munos@google.com Koray Kavukcuoglu Deep Mind korayk@google.com Nando de Freitas Deep Mind, CIFAR, Oxford University nandodefreitas@google.com |
| Pseudocode | Yes | The ACER algorithm results from a combination of the above ideas, with the precise pseudo-code appearing in Appendix A. Algorithm 1 ACER for discrete actions (master algorithm) // Assume global shared parameter vectors θ and θv. // Assume ratio of replay r. repeat Call ACER on-policy, Algorithm 2. n Possion(r) for i {1, , n} do Call ACER off-policy, Algorithm 2. end for until Max iteration or time reached. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described in this paper. It provides a link to videos of learned policies but not to the source code itself. |
| Open Datasets | No | The paper uses the Arcade Learning Environment and MuJoCo physics engine for simulations, which are environments, not datasets with specific access information provided in the context of publicly available datasets. It refers to '57-game Atari domain' and '6 continuous control tasks', which are environments or tasks, but no specific dataset links or citations are provided for data files themselves. |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology). It mentions evaluating on the '57-game Atari domain' and '6 continuous control tasks' without specifying training, validation, or test splits for data. |
| Hardware Specification | No | The paper mentions '16 actor-learner threads running on a single machine with no GPUs' but lacks specific hardware details such as exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details, such as library or solver names with version numbers, needed to replicate the experiment. |
| Experiment Setup | Yes | Our experimental setup uses 16 actor-learner threads running on a single machine with no GPUs. We adopt the same input pre-processing and network architecture as Mnih et al. (2015). Specifically, the network consists of a convolutional layer with 32 8 8 filters with stride 4 followed by another convolutional layer with 64 4 4 filters with stride 2, followed by a final convolutional layer with 64 3 3 filters with stride 1, followed by a fully-connected layer of size 512. Each of the hidden layers is followed by a rectifier nonlinearity. The network outputs a softmax policy and Q values. When using replay, we add to each thread a replay memory that is up to 50 000 frames in size. For all Atari experiments, we use a single learning rate adopted from an earlier implementation of A3C without further tuning. We do not anneal the learning rates over the course of training as in Mnih et al. (2016). We otherwise adopt the same optimization procedure as in Mnih et al. (2016). Specifically, we adopt entropy regularization with weight 0.001, discount the rewards with γ = 0.99, and perform updates every 20 steps (k = 20 in the notation of Section 2). In all our experiments with experience replay, we use importance weight truncation with c = 10. When trust region updating is used, we use δ = 1 and α = 0.99 for all experiments. We use diagonal Gaussian policies with fixed diagonal covariances where the diagonal standard deviation is set to 0.3. For all setups, we sample the learning rates log-uniformly in the range [10 4, 10 3.3]. For setups involving trust region updating, we also sample δ uniformly in the range [0.1, 2]. With all setups, we use 30 sampled hyper-parameter settings. |