Simplified Action Decoder for Deep Multi-Agent Reinforcement Learning

Authors: Hengyuan Hu, Jakob N Foerster

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper we present a new deep multi-agent RL method, the Simplified Action Decoder (SAD), which resolves this contradiction exploiting the centralized training phase. During training SAD allows other agents to not only observe the (exploratory) action chosen, but agents instead also observe the greedy action of their team mates. By combining this simple intuition with best practices for multi-agent learning, SAD establishes a new SOTA for learning methods for 2-5 players on the self-play part of the Hanabi challenge. Our ablations show the contributions of SAD compared with the best practice components. All of our code and trained agents are available at https://github.com/facebookresearch/Hanabi_SAD.
Researcher Affiliation Industry Hengyuan Hu, Jakob N Foerster Facebook AI Research, CA, USA {hengyuan,jnf}@fb.com
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes All of our code and trained agents are available at https://github.com/facebookresearch/Hanabi_SAD. In order to ensure that our results can be easily verified and extended, we also evaluate our method on a proof-of-principle matrix game and open-source our training code and agents. The code is available here: www.bit.ly/2mBJLyk.
Open Datasets Yes To ensure reproducibility and comparability of our results we use the Hanabi Learning Environment (HLE) (Bard et al., 2019) for all experimentation. For further details regarding Hanabi and the self-play part of the Hanabi challenge please see Bard et al. (2019).
Dataset Splits No The paper does not explicitly provide specific details about training, validation, and test dataset splits, such as percentages, sample counts, or methodology for creating such splits. It mentions 'evaluations of the best model from our various training runs' (Table 2), which implies some form of model selection, but not a distinct validation data split.
Hardware Specification Yes In all Hanabi experiments, we run N = 80 actor threads with K = 80 environments in each thread on single machine with 40 CPU cores and 2 GPUs. All asynchronous actors share one GPU and the trainer uses another GPU for gradient computation and model updates.
Software Dependencies No The paper mentions various methods and architectures (e.g., 'double DQN', 'dueling network architecture', 'prioritized replay', 'Adam optimizer') by citing their original papers, but it does not specify concrete version numbers for any software libraries or dependencies (e.g., Python, PyTorch, TensorFlow versions, or specific library versions).
Experiment Setup Yes Our Hanabi agent uses dueling network architecture (Wang et al., 2015). The main body of the network consists of 1 fully connected layer of 512 units and 2 LSTM (Hochreiter & Schmidhuber, 1997) layers of 512 units, followed by two output heads for value and advantages respectively. The maximum length of an episode is capped at 80 steps... Each actor executes an ϵi-greedy policy where ϵi = ϵ1+ 1 N 1 α for i {0, ..., N 1} but with a smaller ϵ = 0.1 and α = 7... The per time-step priority δt is the TD error and per episode priority is computed following δe = η maxt δi + (1 η)ˆδ where η = 0.9. Priority exponent is set to 0.9 and importance sampling exponent is set to 0.6. We use n-step return (Sutton, 1988) and double Q-learning (van Hasselt et al., 2015) for target computation during training. The discount factor γ is set to 0.999. The network is updated using Adam optimizer (Kingma & Ba, 2014) with learning rate lr = 6.25 10 5 and ϵ = 1.5 10 5. Trainer sends its network weights to all actors every 10 updates and target network is synchronized with online network every 2500 updates. These hyper-parameters are fixed across all experiments. The prioritized replay buffer contains 217(131072) episodes. We warm up the replay buffer with 10,000 episodes before training starts. Batch size during training is 128 for games of different numbers of players... The replay buffer size is reduced to 216 for 2-player and 3-player games and 215 for 4-player and 5-player games. The batch sizes for 2-, 3-, 4-, 5-players are 64, 43, 32, 26 respectively to account for the fact that each sample contains more data.