Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

What Really is a Member? Discrediting Membership Inference via Poisoning

Authors: Neal Mangaokar, Ashish Hooda, Zhuohang Li, Bradley Malin, Kassem Fawaz, Somesh Jha, Atul Prakash, Amrita Roy Chowdhury

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Poison M against several MI tests across different datasets and model sizes, and find that it consistently flips test predictions and degrades performance well below random. Thus, our results highlight a disconnect between how MI tests operate and how their outputs are interpreted to determine membership in practice, calling for a re-evaluation of what it truly means for a point to be a member. Section 6: Evaluation, Experimental Setup, Models and Training, Datasets, Metrics. Tables 1, 2, 3, 4, 5, 6, 7, 8 present empirical results. Figure 2, 3 show ROC curves and graphs.
Researcher Affiliation	Academia	University of Michigan, Ann Arbor University of Wisconsin-Madison Vanderbilt University
Pseudocode	Yes	Algorithm 1 Poison M Attack
Open Source Code	Yes	Code in supplementary material and datasets are public.
Open Datasets	Yes	We use Wikitext-103 as background and AI4Privacy/AGNews as canary datasets
Dataset Splits	Yes	injecting 500 canaries into 100K background points and holding out another 500 canaries for evaluation.
Hardware Specification	Yes	We run all experiments on a machine with 4 NVIDIA H100 GPUs, 40 Intel(R) Xeon(R) Silver 4410T CPUs, and 126GB of RAM.
Software Dependencies	No	The paper mentions 'Adam W' as an optimizer and 'Pythia models' but does not provide specific version numbers for software libraries or environments beyond this, which would be necessary for a reproducible description of ancillary software.
Experiment Setup	Yes	All models are fine-tuned (for 1 epoch) on poisoned data using Adam W (lr = 2e-5, batch size = 16).