Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Nearly-Linear Time and Massively Parallel Algorithms for $k$-anonymity

Authors: Kevin Aydin, Honghao Lin, David P. Woodruff, Peilin Zhong

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we also demonstrate that our algorithmic ideas can be adapted to existing heuristic methods, leading to significant speed-ups while preserving comparable performance. On the hardness side, we study the related single-point k-anonymity problem, where the goal is to select k 1 additional records to make a given record indistinguishable. Assuming the dense vs random conjecture in complexity theory, we show that for n = kc, no algorithm can achieve a k1 O(1/c) approximation in poly(n) time, providing evidence for the inherent hardness of the k-anonymity problem. 5 Experiments All of our experiments were conducted on a device with a 3.30GHz CPU and 16GB RAM. We will use the following dataset which has been widely used in the study of anonymized privacy protection: Adult.3 The Adult data contains 48842 tuples from US Census data. The tuples with missing values are removed. In particular, we choose 8 attributes as quasi-identifier. Results Summary. The experimental results are presented in Figure 1. We vary the value of k from 5 to 30 and report the number of hidden attributes.
Researcher Affiliation	Collaboration	Kevin Aydin Google Research EMAIL Honghao Lin Carnegie Mellon University EMAIL David P. Woodruff CMU & Google Research EMAIL Peilin Zhong Google Research EMAIL
Pseudocode	Yes	Algorithm 1 k-Anonymity via Near Neighbors Algorithm 2 Clustering with Pointwise Guarantee
Open Source Code	Yes	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Yes, we include the code in the supplementary materials.
Open Datasets	Yes	Adult.3 The Adult data contains 48842 tuples from US Census data. The tuples with missing values are removed. In particular, we choose 8 attributes as quasi-identifier. 3The Adult from the UCI Machine Learning Repository.
Dataset Splits	No	The Adult data contains 48842 tuples from US Census data. The tuples with missing values are removed. In particular, we choose 8 attributes as quasi-identifier. No specific information on training/test/validation splits (e.g., percentages, sample counts, or predefined split references) is provided.
Hardware Specification	Yes	All of our experiments were conducted on a device with a 3.30GHz CPU and 16GB RAM.
Software Dependencies	No	Ours (Heuristic) : Implemented in C++ for efficiency. [ZWL+18]: Since no official implementation was available, we implemented the algorithm ourselves in C++. We applied the same optimization strategies as in our own implementation. The paper mentions implementation in C++ but does not specify version numbers for C++ compilers, libraries, or other software dependencies.
Experiment Setup	No	Results Summary. The experimental results are presented in Figure 1. We vary the value of k from 5 to 30 and report the number of hidden attributes. 1. For the center selection step (Step 1), instead of computing the average distance for all points, we randomly sample a fixed number of candidate points. 2. Once a new center is chosen, we leverage Locality-Sensitive Hashing (LSH) to efficiently compute its approximate k-nearest neighbors. This process is similar to the procedures in Lemmas 3.5 and 3.6. 4 In our experiments, these modifications lead to a substantial reduction in runtime while preserving performance comparable to the original heuristic. While the paper describes varying 'k' and using a random sampling strategy, it does not provide specific numerical values for hyperparameters such as the 'fixed number of candidate points' sampled, or other configuration settings.