Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Entity Resolution in a Big Data Framework

Authors: Mayank Kejriwal

AAAI 2015 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate all proposed algorithms both on established benchmarks, as well as new datasets procured in the hope of aiding future research efforts. We implement the prototype on 32 HDInsight3 nodes in the Microsoft Azure cloud infrastructure and evaluate it on real-world Big Data.
Researcher Affiliation	Academia	Mayank Kejriwal University of Texas at Austin 5.424C, Stop D9500 2317 Speedway, Austin, TX 78712 mayankkejriwal.azurewebsites.net EMAIL, 1-217-819-6696
Pseudocode	No	The provided text does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	No	The paper mentions implementing a prototype in the Map Reduce framework but does not provide any statement or link indicating that the source code for their specific implementation is openly available.
Open Datasets	No	The paper states 'We evaluate all proposed algorithms both on established benchmarks, as well as new datasets procured in the hope of aiding future research efforts.' However, it does not provide concrete access information (e.g., specific links, DOIs, or formal citations with author/year) for these datasets, making their public availability unconfirmable for replication.
Dataset Splits	No	The paper does not provide specific percentages, sample counts, or methods for training/validation/test dataset splits. It only generally mentions evaluating on 'established benchmarks' and 'new datasets'.
Hardware Specification	No	The paper states, 'We implement the prototype on 32 HDInsight3 nodes in the Microsoft Azure cloud infrastructure.' While it mentions 'nodes' and a cloud service, it does not provide specific hardware details such as GPU/CPU models, processor types, or memory specifications.
Software Dependencies	No	The paper mentions the 'Map Reduce framework' and 'Apache Hadoop as a service' but does not specify version numbers for these software components or any other libraries, which is necessary for reproducible dependency information.
Experiment Setup	No	The paper describes the implementation of a prototype and its evaluation environment ('32 HDInsight3 nodes'), but it does not provide specific experimental setup details such as hyperparameter values, training configurations, or system-level settings required to replicate the experiments.