Determinantal Reinforcement Learning
Authors: Takayuki Osogami, Rudy Raymond4659-4666
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that the proposed approach allows the agents to learn a nearly optimal policy approximately ten times faster than baseline approaches in benchmark tasks of multi-agent reinforcement learning. |
| Researcher Affiliation | Industry | Takayuki Osogami, Rudy Raymond IBM Research Tokyo Tokyo, Japan |
| Pseudocode | Yes | Algorithm 1 Determinantal SARSA |
| Open Source Code | No | The paper does not provide explicit statements about releasing source code, a repository link, or mention code availability in supplementary materials for the described methodology. |
| Open Datasets | Yes | We evaluate the performance of Determinantal SARSA on the blocker task (Sallans and Hinton 2001; 2004; Heess, Silver, and Teh 2013; Sallans 2002) and the stochastic policy task (Heess, Silver, and Teh 2013; Sallans 2002), which have been designed to evaluate the performance of multi-agent reinforcement learning methods. |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning. |
| Hardware Specification | Yes | All of the experiments are carried out with Python implementation on a workstation having 48 GB memory and 4.0 GHz CPU. |
| Software Dependencies | No | The paper only mentions 'Python implementation' without specifying its version number or any other software dependencies with their specific versions. |
| Experiment Setup | Yes | Here, we let the learning rate decrease over time with a simple back-off strategy (Dabney and Barto 2012), where the learning rate at step t is ηt = η0 min{1, 10^4/(t + 1)} with η0 = 10^-3. The discount factor is set ρ = 0.9. In Boltzmann exploration, we let the inverse temperature βt increase over time t: βt = (β10^4)^t/10^4 with β10^4 = 10.0. These hyper-parameters are set as the values that give best performance for the initial 10,000 steps of one run, where the candidate values are η0 ∈ {10^-2, 10^-3, 10^-4}, ρ ∈ {0.9, 0.95, 1.0}, and β10^4 ∈ {1.0, 10.0, 100.0}. |