reproducibilityindex.ai

Efficiently Computing Similarities to Private Datasets

Authors: Arturs Backurs, Zinan Lin, Sepideh Mahabadi, Sandeep Silwal, Jakub Tarnawski

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our algorithms empirically exhibit improved query times and accuracy over prior state of the art. We also present an application to DP classification. Our experiments demonstrate that the simple methodology of classifying based on average similarity is orders of magnitude faster than prior DP-SGD based approaches for comparable accuracy.
Researcher Affiliation	Collaboration	Arturs Backurs Microsoft Research arturs.backurs@gmail.com Zinan Lin Microsoft Research zinanlin@microsoft.com Sepideh Mahabadi Microsoft Research mahabadi@ttic.edu Sandeep Silwal MIT silwal@mit.edu Jakub Tarnawski Microsoft Research jakub.tarnawski@microsoft.com
Pseudocode	Yes	Algorithm 1 Pre-processing data structure; Algorithm 2 Interval Query; Algorithm 3 One dimensional Distance Query; Algorithm 4 High-dimensional ℓ1 distance query
Open Source Code	No	The paper does not provide a link to its own open-source code nor explicitly states that it is made publicly available.
Open Datasets	Yes	Our dataset consists of embeddings of CIFAR-10 in dimension 2048, computed from an Inceptionv3 model (Szegedy et al., 2016), pre-trained on Image Net (Deng et al., 2009).
Dataset Splits	No	The paper mentions "train and test splits" but does not explicitly detail a separate validation split or its size/methodology for their own experiments. It mentions hyper-parameter tuning for baselines but not their own explicit validation split.
Hardware Specification	Yes	All experiments unless stated otherwise are implemented in Python 3.9 on an M1 Macbook Pro with 32GB of RAM.
Software Dependencies	No	The paper only mentions "Python 3.9" but does not list any specific software libraries or solvers with version numbers.
Experiment Setup	Yes	Our hyper-parameters. For our method, we take the embeddings from the pre-trained r152 3x sk1 Sim CLR network released from Chen et al. (2020b). Our embeddings were in dimensions 6144. Since we are computing the ℓ2 distance squared, we can apply Johnson-Lindenstrauss to reduce the dimensionality, without any privacy loss. Furthermore, we can clip the embeddings as well, which reduces the overall sensitivity of our algorithm (to reduce the R dependency in Section D.1) Thus, our hyper-parameters were the projection dimensions, which we looped from 100 to 2000 and the clipping threshold, which we picked from 10 choices in [0.001, 1].