reproducibilityindex.ai

Sample Efficient Actor-Critic with Experience Replay

Authors: Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, Nando de Freitas

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper presents an actor-critic deep reinforcement learning agent with experience replay that is stable, sample efﬁcient, and performs remarkably well on challenging environments, including the discrete 57-game Atari domain and several continuous control problems. We use the Arcade Learning Environment of Bellemare et al. (2013) to conduct an extensive evaluation.
Researcher Affiliation	Collaboration	Ziyu Wang Deep Mind ziyu@google.com Victor Bapst Deep Mind vbapst@google.com Nicolas Heess Deep Mind heess@google.com Volodymyr Mnih Deep Mind vmnih@google.com Remi Munos Deep Mind Munos@google.com Koray Kavukcuoglu Deep Mind korayk@google.com Nando de Freitas Deep Mind, CIFAR, Oxford University nandodefreitas@google.com
Pseudocode	Yes	The ACER algorithm results from a combination of the above ideas, with the precise pseudo-code appearing in Appendix A. Algorithm 1 ACER for discrete actions (master algorithm) // Assume global shared parameter vectors θ and θv. // Assume ratio of replay r. repeat Call ACER on-policy, Algorithm 2. n Possion(r) for i {1, , n} do Call ACER off-policy, Algorithm 2. end for until Max iteration or time reached.
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described in this paper. It provides a link to videos of learned policies but not to the source code itself.
Open Datasets	No	The paper uses the Arcade Learning Environment and MuJoCo physics engine for simulations, which are environments, not datasets with specific access information provided in the context of publicly available datasets. It refers to '57-game Atari domain' and '6 continuous control tasks', which are environments or tasks, but no specific dataset links or citations are provided for data files themselves.
Dataset Splits	No	The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology). It mentions evaluating on the '57-game Atari domain' and '6 continuous control tasks' without specifying training, validation, or test splits for data.
Hardware Specification	No	The paper mentions '16 actor-learner threads running on a single machine with no GPUs' but lacks specific hardware details such as exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details, such as library or solver names with version numbers, needed to replicate the experiment.
Experiment Setup	Yes	Our experimental setup uses 16 actor-learner threads running on a single machine with no GPUs. We adopt the same input pre-processing and network architecture as Mnih et al. (2015). Speciﬁcally, the network consists of a convolutional layer with 32 8 8 ﬁlters with stride 4 followed by another convolutional layer with 64 4 4 ﬁlters with stride 2, followed by a ﬁnal convolutional layer with 64 3 3 ﬁlters with stride 1, followed by a fully-connected layer of size 512. Each of the hidden layers is followed by a rectiﬁer nonlinearity. The network outputs a softmax policy and Q values. When using replay, we add to each thread a replay memory that is up to 50 000 frames in size. For all Atari experiments, we use a single learning rate adopted from an earlier implementation of A3C without further tuning. We do not anneal the learning rates over the course of training as in Mnih et al. (2016). We otherwise adopt the same optimization procedure as in Mnih et al. (2016). Speciﬁcally, we adopt entropy regularization with weight 0.001, discount the rewards with γ = 0.99, and perform updates every 20 steps (k = 20 in the notation of Section 2). In all our experiments with experience replay, we use importance weight truncation with c = 10. When trust region updating is used, we use δ = 1 and α = 0.99 for all experiments. We use diagonal Gaussian policies with ﬁxed diagonal covariances where the diagonal standard deviation is set to 0.3. For all setups, we sample the learning rates log-uniformly in the range [10 4, 10 3.3]. For setups involving trust region updating, we also sample δ uniformly in the range [0.1, 2]. With all setups, we use 30 sampled hyper-parameter settings.