Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
What Really is a Member? Discrediting Membership Inference via Poisoning
Authors: Neal Mangaokar, Ashish Hooda, Zhuohang Li, Bradley Malin, Kassem Fawaz, Somesh Jha, Atul Prakash, Amrita Roy Chowdhury
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Poison M against several MI tests across different datasets and model sizes, and find that it consistently flips test predictions and degrades performance well below random. Thus, our results highlight a disconnect between how MI tests operate and how their outputs are interpreted to determine membership in practice, calling for a re-evaluation of what it truly means for a point to be a member. Section 6: Evaluation, Experimental Setup, Models and Training, Datasets, Metrics. Tables 1, 2, 3, 4, 5, 6, 7, 8 present empirical results. Figure 2, 3 show ROC curves and graphs. |
| Researcher Affiliation | Academia | University of Michigan, Ann Arbor University of Wisconsin-Madison Vanderbilt University |
| Pseudocode | Yes | Algorithm 1 Poison M Attack |
| Open Source Code | Yes | Code in supplementary material and datasets are public. |
| Open Datasets | Yes | We use Wikitext-103 as background and AI4Privacy/AGNews as canary datasets |
| Dataset Splits | Yes | injecting 500 canaries into 100K background points and holding out another 500 canaries for evaluation. |
| Hardware Specification | Yes | We run all experiments on a machine with 4 NVIDIA H100 GPUs, 40 Intel(R) Xeon(R) Silver 4410T CPUs, and 126GB of RAM. |
| Software Dependencies | No | The paper mentions 'Adam W' as an optimizer and 'Pythia models' but does not provide specific version numbers for software libraries or environments beyond this, which would be necessary for a reproducible description of ancillary software. |
| Experiment Setup | Yes | All models are fine-tuned (for 1 epoch) on poisoned data using Adam W (lr = 2e-5, batch size = 16). |