Embedding Dimension of Contrastive Learning and $k$-Nearest Neighbors

Authors: Dmitrii Avdiukhin, Vaggos Chatziafratis, Orr Fischer, Grigory Yaroslavtsev

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We study the embedding dimension of distance comparison data in two settings: contrastive learning and k-nearest neighbors (k-NN). Our goal is to find the smallest dimension d of an ℓp-space in which a given dataset can be represented. We show that the arboricity of the associated graphs plays a key role in designing embeddings. For the most popular ℓ2-space, we get tight bounds in both settings. In contrastive learning, we are given m labeled samples (xi, y+ i , z i ) representing the fact that the positive example yi is closer to the anchor xi than the negative example zi (we also give results for t negatives). For representing such dataset in: ℓ2: d = Θ( m) is necessary and sufficient, consistent with our experiments. ... In k-NN, for each of the n data points we are given an ordered set of the closest k points. We show that for preserving the ordering of the k-NN for every point in: ℓ2: d = Θ(k) is necessary and sufficient. ... We perform experiments on CIFAR-10 and CIFAR-100 image datasets [KH09] (we show additional experiments in Appendix A). ... We present our results in Figure 4.
Researcher Affiliation Academia Dmitrii Avdiukhin Computer Science Department Northwestern University Evanston, IL 60657, USA dmitrii.avdiukhin@northwestern.edu Vaggos Chatziafratis Computer Science and Engineering Department University of California at Santa Cruz Santa Cruz, CA 95064, USA vaggos@ucsc.edu Orr Fischer Computer Science Department Bar-Ilan University Ramat-Gan, Israel fischeo@biu.ac.il Grigory Yaroslavtsev Computer Science Department George Mason University Fairfax, VA 22030, USA grigory@gmu.edu
Pseudocode No The paper describes mathematical constructions and proofs (e.g., Construction of ˆxi, Construction of xi) but does not include formal pseudocode blocks or algorithms labeled as such.
Open Source Code Yes Justification: code added to supplementary material.
Open Datasets Yes We perform experiments on CIFAR-10 and CIFAR-100 image datasets [KH09] (we show additional experiments in Appendix A).
Dataset Splits No The paper mentions training on CIFAR-10 and CIFAR-100 and evaluating accuracy, but it does not provide specific percentages or counts for training, validation, or test splits. It states: 'Let Q be contrastive triplets sampled uniformly at random from all possible triplets of images, labeled based on the ground-truth distance.'
Hardware Specification Yes We train the model for 50 epochs on a single NVIDIA A100 GPU using triplet loss [SKP15b]: LF (x, y+, z ) = F(x) F(y) 2 F(x) F(z) 2+1.
Software Dependencies No The paper mentions using a 'pretrained Res Net-18 neural network' and 'triplet loss', but it does not specify any software names with version numbers (e.g., PyTorch 1.x, TensorFlow 2.x, scikit-learn 0.y).
Experiment Setup Yes We train the model for 50 epochs on a single NVIDIA A100 GPU using triplet loss [SKP15b]: LF (x, y+, z ) = F(x) F(y) 2 F(x) F(z) 2+1. Since our goal is to find an embedding of this set of queries, we evaluate the accuracy as the fraction of satisfied contrastive samples.