Matryoshka Representation Learning

Authors: Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, Ali Farhadi

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental MRL minimally modifies existing representation learning pipelines and imposes no additional cost during inference and deployment. MRL learns coarse-to-fine representations that are at least as accurate and rich as independently trained low-dimensional representations. The flexibility within the learned Matryoshka Representations offer: (a) up to 14 smaller embedding size for Image Net-1K classification at the same level of accuracy; (b) up to 14 real-world speed-ups for large-scale retrieval on Image Net-1K and 4K; and (c) up to 2% accuracy improvements for long-tail few-shot classification, all while being as robust as the original representations. Finally, we show that MRL extends seamlessly to web-scale datasets (Image Net, JFT) across various modalities vision (Vi T, Res Net), vision + language (ALIGN) and language (BERT).
Researcher Affiliation Collaboration University of Washington, Google Research, Harvard University {kusupati,ali}@cs.washington.edu, prajain@google.com
Pseudocode Yes Refer to Alg 1 and Alg 2 in Appendix A for the building blocks of Matryoshka Representation Learning (MRL).
Open Source Code Yes MRL code and pretrained models are open-sourced at https://github.com/RAIVNLab/MRL.
Open Datasets Yes We adapt Matryoshka Representation Learning (MRL) to various representation learning setups (a) Supervised learning for vision: Res Net50 [27] on Image Net-1K [71] and Vi T-B/16 [22] on JFT-300M [80], (b) Contrastive learning for vision + language: ALIGN model with Vi T-B/16 vision encoder and BERT language encoder on ALIGN data [44] and (c) Masked language modelling: BERT [19] on English Wikipedia and Books Corpus [97].
Dataset Splits Yes Image Net-1K (trainset with 1.3M samples as the database and validation set with 50K samples as the queries). ... We learn thresholds on the maximum softmax probability [31] for each nested classifier on a holdout validation set.
Hardware Specification No The provided text of the paper does not explicitly state the specific hardware used (e.g., GPU models, CPU types) for running the experiments. It only indicates that such details are in Appendix C and I, which are not provided.
Software Dependencies Yes ffcv. https://github.com/libffcv/ffcv/, 2022. commit 607d117.
Experiment Setup Yes We use M = {8, 16, 32, 64, 128, 256, 512, 1024, 2048} and M = {12, 24, 48, 96, 192, 384, 768} as the explicitly optimized nested dimensions respectively. ... For a given query image, we obtained a shortlist, K = 200, of images from the database using a lower-dimensional representation, e.g. Ds = 16 followed by reranking with a higher capacity representation, e.g. Dr = 2048.