Practical Data-Dependent Metric Compression with Provable Guarantees

Authors: Piotr Indyk, Ilya Razenshteyn, Tal Wagner

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Quad Sketch experimentally on both real and synthetic data sets: a SIFT feature data set from [7], MNIST [11], time series data reflecting taxi ridership in New York City [4] and a synthetic data set (Diagonal) containing random points from a one-dimensional subspace (i.e., a line) embedded in a high-dimensional space. The data sets are quite diverse: SIFT and MNIST data sets are de-facto standard test cases for nearest neighbor search and distance preserving sketches, NYC taxi data was designed to contain anomalies and irrelevant dimensions, while Diagonal has extremely low intrinsic dimension. We compare our algorithms to Product Quantization (PQ) [7], a state of the art method for computing distance-preserving sketches, as well as a baseline simple uniform quantization method (Grid). The sketch length/accuracy tradeoffs for QS and PQ are comparable on SIFT and MNIST data, with PQ having higher accuracy for shorter sketches while QS having better accuracy for longer sketches. On NYC taxi data, the accuracy of QS is higher over the whole range of sketch lengths . Finally, Diagonal exemplifies a situation where the low dimensionality of the data set hinders the performance of PQ, while QS naturally adapts to this data set. Overall, QS performs well on typical data sets, while its provable guarantees ensure robust performance in a wide range of scenarios. Both algorithms improve over the baseline quantization method.
Researcher Affiliation Academia Piotr Indyk MIT Ilya Razenshteyn MIT Tal Wagner MIT
Pseudocode No The paper describes the compression scheme in Section 3 with detailed prose steps but does not include formally labeled pseudocode blocks or algorithm boxes.
Open Source Code No The paper does not include any explicit statements about releasing source code or provide links to a code repository for the described methodology.
Open Datasets Yes We evaluate Quad Sketch experimentally on both real and synthetic data sets: a SIFT feature data set from [7], MNIST [11], New York City taxi time series [4] and a synthetic one-dimensional data set embedded in a high-dimensional space. [...] Table 2: Datasets used in our empirical evaluation. The aspect ratio of SIFT and MNIST is estimated on a random sample. Dataset Points Dimension Aspect ratio (Φ) SIFT 1,000,000 128 83.2 MNIST 60,000 784 9.2 NYC Taxi 8,874 48 49.5 Diagonal (synthetic) 10,000 128 20,478,740.2
Dataset Splits No The paper mentions using 'standard query set' for SIFT and MNIST, and '500 queries chosen at random' for Taxi and Diagonal, which implies testing. However, it does not specify explicit training, validation, or test dataset splits with percentages, counts, or a cross-validation setup.
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments, such as GPU models, CPU types, or cloud computing instances.
Software Dependencies No The paper does not provide specific details about ancillary software dependencies, such as library names with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes For Quad Sketch, L ranges in 2, . . . , 20, and Λ ranges in 1, . . . , L 1. For PQ, the number of k-means landmarks per block ranges in 25, 26, . . . , 212. For both algorithms we include the results for all combinations of the parameters, and plot the envelope of the best performing combinations.