Hyperbolic Image-text Representations

Authors: Karan Desai, Maximilian Nickel, Tanmay Rajpurohit, Justin Johnson, Shanmukha Ramakrishna Vedantam

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results show that MERU learns a highly interpretable and structured representation space while being competitive with CLIP s performance on standard multi-modal tasks like image classification and image-text retrieval. Our main objective in the experiments is to establish the competitiveness of hyperbolic representations of MERU as compared to Euclidean representations obtained from CLIP-style models.
Researcher Affiliation Collaboration 1University of Michigan 2Meta AI 3Independent Researcher 4New York University.
Pseudocode No The paper provides mathematical equations and descriptions of methods but no structured pseudocode or algorithm blocks.
Open Source Code No The paper does not explicitly state that the source code for MERU will be made available or provide a link to a repository.
Open Datasets Yes We develop our CLIP baseline and train it using a single public dataset Red Caps (Desai et al., 2021) for easier reproducibility. We evaluate the retrieval capabilities of MERU as compared to CLIP on two established benchmarks: COCO and Flickr30K (Chen et al., 2015; Young et al., 2014).
Dataset Splits Yes COCO evaluation uses the val2017 split... We set λ = 0.2 by running a hyperparameter sweep with Vi T-B/16 models for one epoch. search the regularization cost per dataset, C [10 6, 106], performing two-step search on val split like Radford et al. (2021). Then we train a final classifier on combined train and val splits for a maximum of 1000 iterations, then report top-1 mean per-class accuracy on the test split. Table 6: Datasets used for image classification evaluation. (lists Train, Val, Test splits)
Hardware Specification Yes Our smallest model trains using 8 V100 GPUs in less than one day and significantly outperforms recent CLIP re-implementations that use YFCC (Mu et al., 2022). This CLIP model requires 16 V100 32GB GPUs with a batch size of 4096 and automatic mixed precision (Micikevicius et al., 2018).
Software Dependencies No The paper mentions software like Py Torch (Paszke et al., 2019), timm (Wightman, 2019), scikit-learn (Pedregosa et al., 2011), and Spa Cy (Honnibal et al., 2020), but it does not specify version numbers for these libraries.
Experiment Setup Yes We use the Vision Transformer (Dosovitskiy et al., 2021) as image encoder, considering three models of varying capacity Vi T-S (Chen et al., 2021; Touvron et al., 2021), Vi T-B, and Vi T-L. All use a patch size of 16. The text encoder is same as CLIP a 12-layer, 512 dimensions wide Transformer (Vaswani et al., 2017) language model... We randomly crop 50 100% area of images and resize them to 224 224... We initialize the softmax temperature as τ = 0.07 and clamp it to a minimum value of 0.01. For MERU, we initialize the learnable projection scalars αimg = αtxt = 1/ 512, the curvature parameter c = 1.0 and clamp it in [0.1, 10.0]... We use Adam W (Loshchilov & Hutter, 2019) with weight decay 0.2 and (β1, β2) = (0.9, 0.98). All models are trained for 120K iterations with batch size 2048 ( 20 epochs). The maximum learning rate is 5 10 4, increased linearly for the first 4K iterations, followed by cosine decay to zero (Loshchilov & Hutter, 2016).