reproducibilityindex.ai

Practical Data-Dependent Metric Compression with Provable Guarantees

Authors: Piotr Indyk, Ilya Razenshteyn, Tal Wagner

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Quad Sketch experimentally on both real and synthetic data sets: a SIFT feature data set from [7], MNIST [11], time series data reﬂecting taxi ridership in New York City [4] and a synthetic data set (Diagonal) containing random points from a one-dimensional subspace (i.e., a line) embedded in a high-dimensional space. The data sets are quite diverse: SIFT and MNIST data sets are de-facto standard test cases for nearest neighbor search and distance preserving sketches, NYC taxi data was designed to contain anomalies and irrelevant dimensions, while Diagonal has extremely low intrinsic dimension. We compare our algorithms to Product Quantization (PQ) [7], a state of the art method for computing distance-preserving sketches, as well as a baseline simple uniform quantization method (Grid). The sketch length/accuracy tradeoffs for QS and PQ are comparable on SIFT and MNIST data, with PQ having higher accuracy for shorter sketches while QS having better accuracy for longer sketches. On NYC taxi data, the accuracy of QS is higher over the whole range of sketch lengths . Finally, Diagonal exempliﬁes a situation where the low dimensionality of the data set hinders the performance of PQ, while QS naturally adapts to this data set. Overall, QS performs well on typical data sets, while its provable guarantees ensure robust performance in a wide range of scenarios. Both algorithms improve over the baseline quantization method.
Researcher Affiliation	Academia	Piotr Indyk MIT Ilya Razenshteyn MIT Tal Wagner MIT
Pseudocode	No	The paper describes the compression scheme in Section 3 with detailed prose steps but does not include formally labeled pseudocode blocks or algorithm boxes.
Open Source Code	No	The paper does not include any explicit statements about releasing source code or provide links to a code repository for the described methodology.
Open Datasets	Yes	We evaluate Quad Sketch experimentally on both real and synthetic data sets: a SIFT feature data set from [7], MNIST [11], New York City taxi time series [4] and a synthetic one-dimensional data set embedded in a high-dimensional space. [...] Table 2: Datasets used in our empirical evaluation. The aspect ratio of SIFT and MNIST is estimated on a random sample. Dataset Points Dimension Aspect ratio (Φ) SIFT 1,000,000 128 83.2 MNIST 60,000 784 9.2 NYC Taxi 8,874 48 49.5 Diagonal (synthetic) 10,000 128 20,478,740.2
Dataset Splits	No	The paper mentions using 'standard query set' for SIFT and MNIST, and '500 queries chosen at random' for Taxi and Diagonal, which implies testing. However, it does not specify explicit training, validation, or test dataset splits with percentages, counts, or a cross-validation setup.
Hardware Specification	No	The paper does not provide any specific details about the hardware used to run the experiments, such as GPU models, CPU types, or cloud computing instances.
Software Dependencies	No	The paper does not provide specific details about ancillary software dependencies, such as library names with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	For Quad Sketch, L ranges in 2, . . . , 20, and Λ ranges in 1, . . . , L 1. For PQ, the number of k-means landmarks per block ranges in 25, 26, . . . , 212. For both algorithms we include the results for all combinations of the parameters, and plot the envelope of the best performing combinations.