Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Nonparametric Scoring Rules

Authors: Erik Zawadzki, Sebastien Lahaie

AAAI 2015 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results are provided that conﬁrm rapid convergence and that the expected score correlates well with standard notions of divergence, both important considerations for ensuring that agents are incentivized to report accurate information. We conducted experiments to investigate the empirical properties of the sample-based kernel score.
Researcher Affiliation	Collaboration	Erik Zawadzki Carnegie Mellon University Pittsburgh, PA 15213 EMAIL S ebastien Lahaie Microsoft Research New York, NY 10011 EMAIL
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	No	The paper does not provide any concrete access to source code for the methodology described.
Open Datasets	No	The paper describes generating synthetic data for its experiments: "Our experiments require a way to generate a distribution P for the ground truth of an event and a distribution Q for the agent s beliefs. We generate P and Q independently from the same distribution of distributions. We used a mixture of between ﬁve and ten isotropic Laplace densities with bandwidth 0.05. Centers for the individual Laplace densities were located in [ 1, 1]D uniformly at random, and had weights drawn uniformly from [0, 1]." This is not a publicly available dataset with concrete access information.
Dataset Splits	No	The paper describes how samples and mixtures were generated and used for evaluation (e.g., "For each Pi, we ran M = 250 test instances. Each instance j M consisted of generating two m = 10, 000 sample sets Xi,j, X i,j P m i . For each instance the agent reported a preﬁx of k elements of Xi,j and was evaluated against the set X i,j."), but it does not specify traditional train/validation/test splits on a fixed dataset. Data is synthetically generated for each instance.
Hardware Specification	Yes	All experiments were coded in MATLAB, and run on a 3.40GHz i5-3570K with 8GB RAM.
Software Dependencies	No	The paper states, "All experiments were coded in MATLAB," but does not specify a version number for MATLAB or any other software libraries or dependencies with their versions.
Experiment Setup	Yes	Kernel bandwidth was 0.25, binning used 43 bins. The settings used in this experiment were found through an initial set of calibration experiments. In 1D roughly 45 bins seemed reasonable, whereas 62 bins was the best in 2D, and 43 bins was the best in three dimensions. We used 0.25 as our bandwidth in all three dimensions. The setting of m1 = 1500 was intended to represent a moderately sized report.