Hyperbolic Image-text Representations
Authors: Karan Desai, Maximilian Nickel, Tanmay Rajpurohit, Justin Johnson, Shanmukha Ramakrishna Vedantam
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results show that MERU learns a highly interpretable and structured representation space while being competitive with CLIP s performance on standard multi-modal tasks like image classification and image-text retrieval. Our main objective in the experiments is to establish the competitiveness of hyperbolic representations of MERU as compared to Euclidean representations obtained from CLIP-style models. |
| Researcher Affiliation | Collaboration | 1University of Michigan 2Meta AI 3Independent Researcher 4New York University. |
| Pseudocode | No | The paper provides mathematical equations and descriptions of methods but no structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not explicitly state that the source code for MERU will be made available or provide a link to a repository. |
| Open Datasets | Yes | We develop our CLIP baseline and train it using a single public dataset Red Caps (Desai et al., 2021) for easier reproducibility. We evaluate the retrieval capabilities of MERU as compared to CLIP on two established benchmarks: COCO and Flickr30K (Chen et al., 2015; Young et al., 2014). |
| Dataset Splits | Yes | COCO evaluation uses the val2017 split... We set λ = 0.2 by running a hyperparameter sweep with Vi T-B/16 models for one epoch. search the regularization cost per dataset, C [10 6, 106], performing two-step search on val split like Radford et al. (2021). Then we train a final classifier on combined train and val splits for a maximum of 1000 iterations, then report top-1 mean per-class accuracy on the test split. Table 6: Datasets used for image classification evaluation. (lists Train, Val, Test splits) |
| Hardware Specification | Yes | Our smallest model trains using 8 V100 GPUs in less than one day and significantly outperforms recent CLIP re-implementations that use YFCC (Mu et al., 2022). This CLIP model requires 16 V100 32GB GPUs with a batch size of 4096 and automatic mixed precision (Micikevicius et al., 2018). |
| Software Dependencies | No | The paper mentions software like Py Torch (Paszke et al., 2019), timm (Wightman, 2019), scikit-learn (Pedregosa et al., 2011), and Spa Cy (Honnibal et al., 2020), but it does not specify version numbers for these libraries. |
| Experiment Setup | Yes | We use the Vision Transformer (Dosovitskiy et al., 2021) as image encoder, considering three models of varying capacity Vi T-S (Chen et al., 2021; Touvron et al., 2021), Vi T-B, and Vi T-L. All use a patch size of 16. The text encoder is same as CLIP a 12-layer, 512 dimensions wide Transformer (Vaswani et al., 2017) language model... We randomly crop 50 100% area of images and resize them to 224 224... We initialize the softmax temperature as τ = 0.07 and clamp it to a minimum value of 0.01. For MERU, we initialize the learnable projection scalars αimg = αtxt = 1/ 512, the curvature parameter c = 1.0 and clamp it in [0.1, 10.0]... We use Adam W (Loshchilov & Hutter, 2019) with weight decay 0.2 and (β1, β2) = (0.9, 0.98). All models are trained for 120K iterations with batch size 2048 ( 20 epochs). The maximum learning rate is 5 10 4, increased linearly for the first 4K iterations, followed by cosine decay to zero (Loshchilov & Hutter, 2016). |