Dataset Inference for Self-Supervised Models

Authors: Adam Dziedzic, Haonan Duan, Muhammad Ahmad Kaleem, Nikita Dhawan, Jonas Guan, Yannis Cattan, Franziska Boenisch, Nicolas Papernot

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive empirical results in the vision domain demonstrate that dataset inference is a promising direction for defending self-supervised models against model stealing.
Researcher Affiliation Academia University of Toronto and Vector Institute
Pseudocode Yes Algorithm 1 summarizes the stealing approach used by an adversary.
Open Source Code No The paper references 'an open-source Py Torch implementation of Sim CLR 3' (https://github.com/kuangliu/pytorch-cifar), but this is a third-party tool used by the authors, not their own source code for the proposed defense.
Open Datasets Yes We evaluate our defense against encoder extraction attacks using five different vision datasets (CIFAR10, CIFAR100 [28], SVHN [34], STL10 [8], and Image Net [11]).
Dataset Splits Yes For SVHN, we merge the original training and test splits, and use the randomly-selected 80% as the training set and the rest 20% as the test set. For SVHN and CIFAR10, we use 50% of the training set to train GMMs, and the remaining for evaluation.
Hardware Specification No The paper does not provide specific details on the GPU or CPU models used for the experiments, or any other hardware specifications.
Software Dependencies No The paper mentions using a 'Py Torch implementation' but does not specify its version number or any other software dependencies with version details.
Experiment Setup Yes We train GMMs with 10 components for SVHN and CIFAR10, and 50 components for Image Net. In general, we observe that the larger number of components for GMMs, the better the defense is. For Image Net, we restrict the covariance matrix to be diagonal for efficiency. For CIFAR10 and SVHN, we use the full covariance matrix.