reproducibilityindex.ai

Proper Proxy Scoring Rules

Authors: Jens Witkowski, Pavel Atanasov, Lyle Ungar, Andreas Krause

AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we evaluate the quadratic proxy scoring rule with an example proxy experimentally using a real-world forecasting data set.
Researcher Affiliation	Collaboration	Jens Witkowski ETH Zurich jensw@inf.ethz.ch Pavel Atanasov Pytho LLC pavel@pytho.io Lyle H. Ungar University of Pennsylvania ungar@cis.upenn.edu Andreas Krause ETH Zurich krausea@ethz.ch
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets	Yes	The data set we use is from the Good Judgment Project, a research and development project that provides probabilistic forecasts of geopolitical events to the United States intelligence community. We use data from the third year... For more details, including how the incentive issues that are inherent in self selection were addressed, we refer to the work by Atanasov et al. (2016).
Dataset Splits	Yes	randomly sample two sets of questions, which we refer to as the selection set and the validation set. The validation set has size 30 and is scored using the quadratic proper scoring rule from Deﬁnition 4 with access to the corresponding 30 event outcomes.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU, GPU models, or memory specifications) used for running its experiments.
Software Dependencies	No	The paper does not provide specific software dependencies or their version numbers that would be needed to replicate the experiment.
Experiment Setup	Yes	The optimal α for the original dynamic Good Judgment data set, which was optimized out of sample from earlier seasons, is α = 2 (Atanasov et al. 2016), and so we are also using α = 2 in our experiments. ... We subset to forecaster pairs with at least 60 questions in common (of which there are 210) and, for each forecaster pair, randomly sample two sets of questions, which we refer to as the selection set and the validation set. The validation set has size 30... The selection set s size goes from 1 to 30...