Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

What Really is a Member? Discrediting Membership Inference via Poisoning

Authors: Neal Mangaokar, Ashish Hooda, Zhuohang Li, Bradley Malin, Kassem Fawaz, Somesh Jha, Atul Prakash, Amrita Roy Chowdhury

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Poison M against several MI tests across different datasets and model sizes, and find that it consistently flips test predictions and degrades performance well below random. Thus, our results highlight a disconnect between how MI tests operate and how their outputs are interpreted to determine membership in practice, calling for a re-evaluation of what it truly means for a point to be a member. Section 6: Evaluation, Experimental Setup, Models and Training, Datasets, Metrics. Tables 1, 2, 3, 4, 5, 6, 7, 8 present empirical results. Figure 2, 3 show ROC curves and graphs.
Researcher Affiliation Academia University of Michigan, Ann Arbor University of Wisconsin-Madison Vanderbilt University
Pseudocode Yes Algorithm 1 Poison M Attack
Open Source Code Yes Code in supplementary material and datasets are public.
Open Datasets Yes We use Wikitext-103 as background and AI4Privacy/AGNews as canary datasets
Dataset Splits Yes injecting 500 canaries into 100K background points and holding out another 500 canaries for evaluation.
Hardware Specification Yes We run all experiments on a machine with 4 NVIDIA H100 GPUs, 40 Intel(R) Xeon(R) Silver 4410T CPUs, and 126GB of RAM.
Software Dependencies No The paper mentions 'Adam W' as an optimizer and 'Pythia models' but does not provide specific version numbers for software libraries or environments beyond this, which would be necessary for a reproducible description of ancillary software.
Experiment Setup Yes All models are fine-tuned (for 1 epoch) on poisoned data using Adam W (lr = 2e-5, batch size = 16).