Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Counterfactual Multi-Agent Policy Gradients

Authors: Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, Shimon Whiteson

AAAI 2018 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate COMA in the testbed of Star Craft unit micromanagement... COMA signiﬁcantly improves average performance over other multi-agent actorcritic methods in this setting...
Researcher Affiliation	Academia	Jakob N. Foerster University of Oxford, United Kingdom EMAIL Gregory Farquhar University of Oxford, United Kingdom EMAIL Triantafyllos Afouras University of Oxford, UK EMAIL Nantas Nardelli University of Oxford, UK EMAIL Shimon Whiteson University of Oxford, UK EMAIL
Pseudocode	Yes	Pseudocode and further details on the training procedure are in the supplementary material.
Open Source Code	No	The paper mentions 'Pseudocode and further details on the training procedure are in the supplementary material,' but does not explicitly state that the source code for their methodology is openly available or provide a link to a repository.
Open Datasets	No	The paper uses Star Craft unit micromanagement as its testbed and mentions Torch Craft for implementation. StarCraft is a commercial game environment, not a publicly available dataset, and the paper does not provide concrete access information for any generated or used dataset.
Dataset Splits	No	The paper does not explicitly provide specific percentages or sample counts for training, validation, and test dataset splits, nor does it cite predefined splits.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory amounts used for running experiments.
Software Dependencies	No	The paper states, 'Our implementation uses Torch Craft (Synnaeve et al. 2016) and Torch 7 (Collobert, Kavukcuoglu, and Farabet 2011),' but it does not specify explicit version numbers for these software dependencies or other libraries.
Experiment Setup	Yes	The actor consists of 128-bit gated recurrent units (GRUs)... We anneal ϵ linearly from 0.5 to 0.02 across 750 training episodes... We found that the most sensitive parameter was TD(λ), but settled on λ = 0.8...