Efficiently Computing Similarities to Private Datasets
Authors: Arturs Backurs, Zinan Lin, Sepideh Mahabadi, Sandeep Silwal, Jakub Tarnawski
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our algorithms empirically exhibit improved query times and accuracy over prior state of the art. We also present an application to DP classification. Our experiments demonstrate that the simple methodology of classifying based on average similarity is orders of magnitude faster than prior DP-SGD based approaches for comparable accuracy. |
| Researcher Affiliation | Collaboration | Arturs Backurs Microsoft Research arturs.backurs@gmail.com Zinan Lin Microsoft Research zinanlin@microsoft.com Sepideh Mahabadi Microsoft Research mahabadi@ttic.edu Sandeep Silwal MIT silwal@mit.edu Jakub Tarnawski Microsoft Research jakub.tarnawski@microsoft.com |
| Pseudocode | Yes | Algorithm 1 Pre-processing data structure; Algorithm 2 Interval Query; Algorithm 3 One dimensional Distance Query; Algorithm 4 High-dimensional ℓ1 distance query |
| Open Source Code | No | The paper does not provide a link to its own open-source code nor explicitly states that it is made publicly available. |
| Open Datasets | Yes | Our dataset consists of embeddings of CIFAR-10 in dimension 2048, computed from an Inceptionv3 model (Szegedy et al., 2016), pre-trained on Image Net (Deng et al., 2009). |
| Dataset Splits | No | The paper mentions "train and test splits" but does not explicitly detail a separate validation split or its size/methodology for their own experiments. It mentions hyper-parameter tuning for baselines but not their own explicit validation split. |
| Hardware Specification | Yes | All experiments unless stated otherwise are implemented in Python 3.9 on an M1 Macbook Pro with 32GB of RAM. |
| Software Dependencies | No | The paper only mentions "Python 3.9" but does not list any specific software libraries or solvers with version numbers. |
| Experiment Setup | Yes | Our hyper-parameters. For our method, we take the embeddings from the pre-trained r152 3x sk1 Sim CLR network released from Chen et al. (2020b). Our embeddings were in dimensions 6144. Since we are computing the ℓ2 distance squared, we can apply Johnson-Lindenstrauss to reduce the dimensionality, without any privacy loss. Furthermore, we can clip the embeddings as well, which reduces the overall sensitivity of our algorithm (to reduce the R dependency in Section D.1) Thus, our hyper-parameters were the projection dimensions, which we looped from 100 to 2000 and the clipping threshold, which we picked from 10 choices in [0.001, 1]. |