Determinantal Reinforcement Learning

Authors: Takayuki Osogami, Rudy Raymond4659-4666

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that the proposed approach allows the agents to learn a nearly optimal policy approximately ten times faster than baseline approaches in benchmark tasks of multi-agent reinforcement learning.
Researcher Affiliation Industry Takayuki Osogami, Rudy Raymond IBM Research Tokyo Tokyo, Japan
Pseudocode Yes Algorithm 1 Determinantal SARSA
Open Source Code No The paper does not provide explicit statements about releasing source code, a repository link, or mention code availability in supplementary materials for the described methodology.
Open Datasets Yes We evaluate the performance of Determinantal SARSA on the blocker task (Sallans and Hinton 2001; 2004; Heess, Silver, and Teh 2013; Sallans 2002) and the stochastic policy task (Heess, Silver, and Teh 2013; Sallans 2002), which have been designed to evaluate the performance of multi-agent reinforcement learning methods.
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning.
Hardware Specification Yes All of the experiments are carried out with Python implementation on a workstation having 48 GB memory and 4.0 GHz CPU.
Software Dependencies No The paper only mentions 'Python implementation' without specifying its version number or any other software dependencies with their specific versions.
Experiment Setup Yes Here, we let the learning rate decrease over time with a simple back-off strategy (Dabney and Barto 2012), where the learning rate at step t is ηt = η0 min{1, 10^4/(t + 1)} with η0 = 10^-3. The discount factor is set ρ = 0.9. In Boltzmann exploration, we let the inverse temperature βt increase over time t: βt = (β10^4)^t/10^4 with β10^4 = 10.0. These hyper-parameters are set as the values that give best performance for the initial 10,000 steps of one run, where the candidate values are η0 ∈ {10^-2, 10^-3, 10^-4}, ρ ∈ {0.9, 0.95, 1.0}, and β10^4 ∈ {1.0, 10.0, 100.0}.