Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
On the Importance of Difficulty Calibration in Membership Inference Attacks
Authors: Lauren Watson, Chuan Guo, Graham Cormode, Alexandre Sablayrolles
ICLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To demonstrate the effect of difficulty calibration, we perform a comprehensive evaluation of several score-based attacks on standard benchmark datasets. |
| Researcher Affiliation | Collaboration | Lauren Watson University of Edinburgh Chuan Guo Graham Cormode Meta AI Alexandre Sablayrolles. Work done during an internship at Facebook. Email:EMAIL, EMAIL |
| Pseudocode | No | The paper describes its methods and algorithms in paragraph text and mathematical equations, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | An implementation of these attacks is available at https://github.com/facebookresearch/ calibration_membership. |
| Open Datasets | Yes | We perform experiments on several benchmark classification datasets: German Credit, Hepatitis and Adult datasets from the UCI Machine Learning Repository (Dua & Graff, 2017), MNIST (Le Cun et al., 1998), CIFAR10/100 (Krizhevsky et al., 2009), and Image Net (Deng et al., 2009). |
| Dataset Splits | Yes | We split the data into two sets: a private set, known only to the trainer, and a public set, which is used for training reference models and selecting the decision threshold τ. The trainer trains their model h on half of the private set, keeping the other half as non-members. ... To find a threshold for optimal accuracy, we first split the public set of examples in half again, and treat one half as members, with the rest as non-members. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions the use of the 'Opacus' library for differentially private training, but it does not specify a version number for this or any other software dependency. |
| Experiment Setup | Yes | The target models are trained for between 50 and 200 epochs, with batch sizes varying from 4 (for very small datasets) to 1024. For optimization, we use SGD with a learning rate of 0.1, Nesterov momentum of 0.9 and a cosine learning rate schedule for the CIFAR10/100 and Image Net datasets. Smaller datasets such as the German Credit dataset also used weight decay of 1×10−4. |