Action Branching Architectures for Deep Reinforcement Learning

Authors: Arash Tavakoli, Fabio Pardo, Petar Kormushev

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the performance of our agent on a set of challenging continuous control tasks. The empirical results show that the proposed agent scales gracefully to environments with increasing action dimensionality and indicate the significance of the shared decision module in coordination of the distributed action branches.
Researcher Affiliation Academia Arash Tavakoli, Fabio Pardo, Petar Kormushev Imperial College London London SW7 2AZ, United Kingdom {a.tavakoli, f.pardo, p.kormushev}@imperial.ac.uk
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures).
Open Source Code No The paper states: "We used the Open AI Baselines (Hesse et al. 2017) implementation of DQN as the basis for the development of all the DQN-based agents." and "We used the DDPG implementation of the rllab suite (Duan et al. 2016)". This indicates the use of existing open-source code, but the authors do not state that they are providing their own source code for the methodology described in this paper.
Open Datasets Yes We then compare the performance of BDQ against a state-of-the-art continuous control algorithm, Deep Deterministic Policy Gradient (DDPG), on a set of standard continuous control manipulation and locomotion benchmark domains from the Open AI’s MuJoCo Gym collection (Brockman et al. 2016; Duan et al. 2016).
Dataset Splits No The paper describes how evaluations were conducted periodically during training ("Evaluations were conducted every 50 episodes of training for 30 episodes with a greedy policy."), but it does not specify explicit train/validation/test *dataset splits* with percentages or counts, which is not typical for reinforcement learning environments.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions using specific optimizers, initializations, and frameworks (e.g., "Adam optimizer", "ReLU", "Xavier initialization", "Open AI Baselines", "rllab suite") but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes We used the Adam optimizer (Kingma and Ba 2015) with a learning rate of 10^-4, β1 = 0.9, and β2 = 0.999. We trained with a minibatch size of 64 and a discount factor γ = 0.99. The target network was updated every 10^3 time steps. We used the rectified non-linearity (or ReLU) (Glorot, Bordes, and Bengio 2011) for all hidden layers and linear activation on the output layers. The network had two hidden layers with 512 and 256 units in the shared network module and one hidden layer per branch with 128 units. The weights were initialized using the Xavier initialization (Glorot and Bengio 2010) and the biases were initialized to zero. A gradient clipping of size 10 was applied. We used the prioritized replay with a buffer size of 10^6 and hyperparameters α = 0.6, β0 = 0.4, η = 3 x 10^-7, and ϵ = 10^-8.