Efficiently Computing Similarities to Private Datasets

Authors: Arturs Backurs, Zinan Lin, Sepideh Mahabadi, Sandeep Silwal, Jakub Tarnawski

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our algorithms empirically exhibit improved query times and accuracy over prior state of the art. We also present an application to DP classification. Our experiments demonstrate that the simple methodology of classifying based on average similarity is orders of magnitude faster than prior DP-SGD based approaches for comparable accuracy.
Researcher Affiliation Collaboration Arturs Backurs Microsoft Research arturs.backurs@gmail.com Zinan Lin Microsoft Research zinanlin@microsoft.com Sepideh Mahabadi Microsoft Research mahabadi@ttic.edu Sandeep Silwal MIT silwal@mit.edu Jakub Tarnawski Microsoft Research jakub.tarnawski@microsoft.com
Pseudocode Yes Algorithm 1 Pre-processing data structure; Algorithm 2 Interval Query; Algorithm 3 One dimensional Distance Query; Algorithm 4 High-dimensional ℓ1 distance query
Open Source Code No The paper does not provide a link to its own open-source code nor explicitly states that it is made publicly available.
Open Datasets Yes Our dataset consists of embeddings of CIFAR-10 in dimension 2048, computed from an Inceptionv3 model (Szegedy et al., 2016), pre-trained on Image Net (Deng et al., 2009).
Dataset Splits No The paper mentions "train and test splits" but does not explicitly detail a separate validation split or its size/methodology for their own experiments. It mentions hyper-parameter tuning for baselines but not their own explicit validation split.
Hardware Specification Yes All experiments unless stated otherwise are implemented in Python 3.9 on an M1 Macbook Pro with 32GB of RAM.
Software Dependencies No The paper only mentions "Python 3.9" but does not list any specific software libraries or solvers with version numbers.
Experiment Setup Yes Our hyper-parameters. For our method, we take the embeddings from the pre-trained r152 3x sk1 Sim CLR network released from Chen et al. (2020b). Our embeddings were in dimensions 6144. Since we are computing the ℓ2 distance squared, we can apply Johnson-Lindenstrauss to reduce the dimensionality, without any privacy loss. Furthermore, we can clip the embeddings as well, which reduces the overall sensitivity of our algorithm (to reduce the R dependency in Section D.1) Thus, our hyper-parameters were the projection dimensions, which we looped from 100 to 2000 and the clipping threshold, which we picked from 10 choices in [0.001, 1].