Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Labeling without Seeing? Blind Annotation for Privacy-Preserving Entity Resolution

Authors: Yixiang Yao, Weizhao Jin, Srivatsan Ravi

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we conduct experiments to empirically evaluate the feasibility of using blind annotation to annotate datasets and the incurred overhead of homomorphic encryption. In general, domain oracles are asked to annotate datasets using blind annotation. The quality of the final annotation results is assessed by comparing them to the ground truth. ... We use precision, recall and F-measure as the standard evaluation metrics to measure the accuracy of blind annotation against the scores computed from ground truth labels.
Researcher Affiliation	Academia	Yixiang Yao EMAIL Department of Computer Science University of Southern California
Pseudocode	Yes	Algorithm 1: Commonly-used Functions ... Algorithm 2: Functions using Bin FHE scheme
Open Source Code	No	The underlying HE program is implemented in Open FHE (Al Badawi et al., 2022), an open-source project that efficiently and extensibly implements the post-quantum Fully Homomorphic Encryption schemes.
Open Datasets	Yes	We use the real-world entity resolution benchmark (Köpcke et al., 2010), which includes 4 tasks and lies in both e-commerce and bibliographic domains. ... One such synthetic dataset is Febrl (Freely Extensible Biomedical Record Linkage) (Christen, 2008), which is widely employed for generating census records containing fields such as name, sex, age, and address.
Dataset Splits	Yes	Specifically, we first randomly sample 50 labeled matches from each provided ground truth, and this covers at most 5% (50 records) of each dataset because one record could link to multiple records. Note that 50 records from each dataset are around 2.5-5% of the original dataset except for Scholar.
Hardware Specification	Yes	All the experiments are conducted on a Linux machine with an 8-core CPU @ 3.60 GHz and 32 GB RAM.
Software Dependencies	No	Specifically, the web GUI is implemented in Python, in which the DSL syntax is written in Extended Backus Naur Form (EBNF) and parsed by Lark library 2 with Look-Ahead Left-to-Right (LALR) parser. The underlying HE program is implemented in Open FHE (Al Badawi et al., 2022)... The latter employs Open MP 4, a multi-platform shared-memory parallel programming library...
Experiment Setup	Yes	Specifically, we set it to work in public-key encryption mode with the crypto context to be STD128, which guarantees more than 128 bits of security for classical computer attacks. ... When t rounds have finished, the protocol ends: the pairs whose value is true (consensus archived) in F are added to the final ground truth, and others are discarded. Therefore, the ground truth set G is constructed as G = {(i, j, l) \| (i, j, l) Gt, F(i, j) = true}. Note that increasing t tends to improve performance, but it also raises the labeling cost. An empirical analysis of the effect of t is presented in Section 5.2.