Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A New and Flexible Approach to the Analysis of Paired Comparison Data

Authors: Ivo F. D. Oliveira, Nir Ailon, Ori Davidov

JMLR 2018 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In the following we describe four experiments performed to further test and investigate Poly Rank. Each simulation is performed 1000 times and we report and discuss the average performance under the speciﬁed conditions. Finally, using a large data-set of computer chess matches, we estimate the comparison function and ﬁnd that the model used by the International Chess Federation does not seem to apply to computer chess.
Researcher Affiliation	Academia	Ivo F. D. Oliveira EMAIL Department of Science, Engineering and Technology UFVJM Federal University of the Valleys of Jequitinhonha and Mucuri Teoﬁlo Otoni, Minas Gerais, Brazil Nir Ailon EMAIL Department of Computer Science Technion Israel Institute of Technionlogy Haifa, Israel Ori Davidov EMAIL Department of Statistics University of Haifa Haifa, Israel
Pseudocode	Yes	Algorithm: Poly Rank
Open Source Code	No	The paper does not provide a direct link to a source-code repository nor an explicit statement that the code for the described methodology is publicly released or included in supplementary materials. The CC-BY 4.0 license refers to the paper itself.
Open Datasets	Yes	Finally, using a large data-set of computer chess matches, we estimate the comparison function and ﬁnd that the model used by the International Chess Federation does not seem to apply to computer chess. 1. Publicly available at http://kirill-kryukov.com/chess/kcec/games.html.
Dataset Splits	No	The paper describes generating data for simulations (e.g., "generating I = 20 items", "50 pairs", "round-robin tournaments with an increasing number of items", "mij = 1 to 5") and uses a real-world dataset ("computer-chess data set"). However, it does not explicitly state how this data was split into training, validation, and test sets for their model in the conventional machine learning sense, but rather describes data generation parameters or total number of comparisons.
Hardware Specification	No	In our experience, problem (6) with any norm (weighted or unweighted) can be tackled successfully with a generic convex optimization solver on a desktop computer for problems of moderate size (e.g. with D 10 and I 120) in at most 2 or 3 seconds. The paper mentions a 'desktop computer' but provides no specific details such as CPU model, GPU, or memory.
Software Dependencies	No	In our experience, problem (6) with any norm (weighted or unweighted) can be tackled successfully with a generic convex optimization solver on a desktop computer for problems of moderate size (e.g. with D 10 and I 120) in at most 2 or 3 seconds. The paper mentions using a 'generic convex optimization solver' but does not specify its name or version number.
Experiment Setup	Yes	Experiment 1: In this experiment we compare the empirical performance of the estimator of P when using Poly Rank with a low degree polynomial with its performance given the correct comparison function. Speciﬁcally, this is done by generating I = 20 items with merits µi sampled uniformly from [0, 10]. A total of 50 pairs, selected randomly, were compared assuming a Bradley-Terry-Luce (BTL) model. We reﬁne the estimator ˆpij = (Yij + 1)/(mij + 2) with Poly Rank using D = 5. Experiment 4: In practice the degree D of the polynomial (4) may not be known in advance. If we choose D to be too small then we may not fully capture the geometry of F, while if D is too large there is a danger of over-ﬁtting and possible numerical problems. In this experiment we investigate the use of some well known model selection criteria (Claeskens and Hjort, 2006) for choosing D. In particular, we test the empirical performance of the Bayesian Information Criterion (BIC) and two variants of the Akaike Information Criterion (AIC) and contrast these with the performance of (leave-one-out) cross-validation.