Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Determinantal Reinforcement Learning
Authors: Takayuki Osogami, Rudy Raymond4659-4666
AAAI 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that the proposed approach allows the agents to learn a nearly optimal policy approximately ten times faster than baseline approaches in benchmark tasks of multi-agent reinforcement learning. |
| Researcher Affiliation | Industry | Takayuki Osogami, Rudy Raymond IBM Research Tokyo Tokyo, Japan |
| Pseudocode | Yes | Algorithm 1 Determinantal SARSA |
| Open Source Code | No | The paper does not provide explicit statements about releasing source code, a repository link, or mention code availability in supplementary materials for the described methodology. |
| Open Datasets | Yes | We evaluate the performance of Determinantal SARSA on the blocker task (Sallans and Hinton 2001; 2004; Heess, Silver, and Teh 2013; Sallans 2002) and the stochastic policy task (Heess, Silver, and Teh 2013; Sallans 2002), which have been designed to evaluate the performance of multi-agent reinforcement learning methods. |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning. |
| Hardware Specification | Yes | All of the experiments are carried out with Python implementation on a workstation having 48 GB memory and 4.0 GHz CPU. |
| Software Dependencies | No | The paper only mentions 'Python implementation' without specifying its version number or any other software dependencies with their specific versions. |
| Experiment Setup | Yes | Here, we let the learning rate decrease over time with a simple back-off strategy (Dabney and Barto 2012), where the learning rate at step t is ηt = η0 min{1, 10^4/(t + 1)} with η0 = 10^-3. The discount factor is set ρ = 0.9. In Boltzmann exploration, we let the inverse temperature βt increase over time t: βt = (β10^4)^t/10^4 with β10^4 = 10.0. These hyper-parameters are set as the values that give best performance for the initial 10,000 steps of one run, where the candidate values are η0 ∈ {10^-2, 10^-3, 10^-4}, ρ ∈ {0.9, 0.95, 1.0}, and β10^4 ∈ {1.0, 10.0, 100.0}. |