Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Frustratingly Easy Truth Discovery

Authors: Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben Shabat, Gal Cohensius, Lirong Xia

AAAI 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We prove that this estimates well the actual competence level and enables separating high and low quality workers in a wide spectrum of domains and statistical models. Under Gaussian noise, this simple estimate is the unique solution to the Maximum Likelihood Estimator with a constant regularization factor. Finally, weighing workers according to their average proximity in a crowdsourcing setting, results in substantial improvement over unweighted aggregation and other truth discovery algorithms in practice.
Researcher Affiliation	Academia	Reshef Meir1, Ofra Amir1, Omer Ben-Porat1, Tsviel Ben-Shabat1, Gal Cohensius1, Lirong Xia2 1 Technion Israel Institute of Technology 2 Rensselaer Polytechnic Institute (RPI) EMAIL, EMAIL, EMAIL
Pseudocode	Yes	ALGORITHM 1: (P-TDD) FOR REAL-VALUED DATA
Open Source Code	No	Most proofs, as well as additional empirical results are available in the full version of the paper on ar Xiv: https://arxiv.org/abs/1905.00629. This is a link to the paper on arXiv, not the source code.
Open Datasets	Yes	Datasets: We used the following datasets from ﬁve different domains. We write the used distance measure in each domain in brackets. Categorical (Hamming distance): GG, DOGS, FLAGS (Shah and Zhou 2015); Predict (Mandal, Radanovic, and Parkes 2020)... Real-valued (NSED): BUILDINGS (collected for this paper); TRI (Hart et al. 2018); and EMO (Snow et al. 2008)... Language (GLEU): The TRANSL dataset contains English translations of Japanese sentences (Braylan and Lease 2020)... Outlines (Jaccard): The Etch-a-Cell dataset contains bitmaps of the outline of a tumor in 2D slices of a cell (Spiers et al. 2021).
Dataset Splits	No	The paper does not provide specific dataset split information (exact percentages, sample counts, or detailed splitting methodology) for training, validation, or test sets. It mentions 'sampled n workers and m questions without repetition from each dataset (real or synthetic), and repeated the process at least 1000 times for every combination' which is a resampling strategy for robustness rather than a fixed split.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup	No	The paper describes the algorithms and their evaluation on datasets but does not specify concrete hyperparameter values, training configurations, or system-level settings for the experiments. It mentions 'sampling n workers and m questions' but no further details on the experimental setup itself.