Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A New and Flexible Approach to the Analysis of Paired Comparison Data

Authors: Ivo F. D. Oliveira, Nir Ailon, Ori Davidov

JMLR 2018 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In the following we describe four experiments performed to further test and investigate Poly Rank. Each simulation is performed 1000 times and we report and discuss the average performance under the specified conditions. Finally, using a large data-set of computer chess matches, we estimate the comparison function and find that the model used by the International Chess Federation does not seem to apply to computer chess.
Researcher Affiliation Academia Ivo F. D. Oliveira EMAIL Department of Science, Engineering and Technology UFVJM Federal University of the Valleys of Jequitinhonha and Mucuri Teofilo Otoni, Minas Gerais, Brazil Nir Ailon EMAIL Department of Computer Science Technion Israel Institute of Technionlogy Haifa, Israel Ori Davidov EMAIL Department of Statistics University of Haifa Haifa, Israel
Pseudocode Yes Algorithm: Poly Rank
Open Source Code No The paper does not provide a direct link to a source-code repository nor an explicit statement that the code for the described methodology is publicly released or included in supplementary materials. The CC-BY 4.0 license refers to the paper itself.
Open Datasets Yes Finally, using a large data-set of computer chess matches, we estimate the comparison function and find that the model used by the International Chess Federation does not seem to apply to computer chess. 1. Publicly available at http://kirill-kryukov.com/chess/kcec/games.html.
Dataset Splits No The paper describes generating data for simulations (e.g., "generating I = 20 items", "50 pairs", "round-robin tournaments with an increasing number of items", "mij = 1 to 5") and uses a real-world dataset ("computer-chess data set"). However, it does not explicitly state how this data was split into training, validation, and test sets for *their* model in the conventional machine learning sense, but rather describes data generation parameters or total number of comparisons.
Hardware Specification No In our experience, problem (6) with any norm (weighted or unweighted) can be tackled successfully with a generic convex optimization solver on a desktop computer for problems of moderate size (e.g. with D 10 and I 120) in at most 2 or 3 seconds. The paper mentions a 'desktop computer' but provides no specific details such as CPU model, GPU, or memory.
Software Dependencies No In our experience, problem (6) with any norm (weighted or unweighted) can be tackled successfully with a generic convex optimization solver on a desktop computer for problems of moderate size (e.g. with D 10 and I 120) in at most 2 or 3 seconds. The paper mentions using a 'generic convex optimization solver' but does not specify its name or version number.
Experiment Setup Yes Experiment 1: In this experiment we compare the empirical performance of the estimator of P when using Poly Rank with a low degree polynomial with its performance given the correct comparison function. Specifically, this is done by generating I = 20 items with merits µi sampled uniformly from [0, 10]. A total of 50 pairs, selected randomly, were compared assuming a Bradley-Terry-Luce (BTL) model. We refine the estimator ˆpij = (Yij + 1)/(mij + 2) with Poly Rank using D = 5. Experiment 4: In practice the degree D of the polynomial (4) may not be known in advance. If we choose D to be too small then we may not fully capture the geometry of F, while if D is too large there is a danger of over-fitting and possible numerical problems. In this experiment we investigate the use of some well known model selection criteria (Claeskens and Hjort, 2006) for choosing D. In particular, we test the empirical performance of the Bayesian Information Criterion (BIC) and two variants of the Akaike Information Criterion (AIC) and contrast these with the performance of (leave-one-out) cross-validation.