Unified Lexical Representation for Interpretable Visual-Language Alignment
Authors: Yifan Li, Yikai Wang, Yanwei Fu, Dongyu Ru, Zheng Zhang, Tong He
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On cross-modal retrieval benchmarks, Lex VLA, trained on the CC-12M multi-modal dataset, outperforms baselines fine-tuned on larger datasets (e.g., YFCC15M) and those trained from scratch on even bigger datasets (e.g., 1.1B data, including CC12M). We conduct extensive experiments to analyze Lex VLA. |
| Researcher Affiliation | Collaboration | 1Fudan University 2 Amazon Web Services yifanli23@m.fudan.edu.cn, yi-kai.wang@outlook.com, yanweifu@fudan.edu.cn, {rudongyu, zhaz, htong}@amazon.com |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Codes are available at https://github.com/Clementine24/Lex VLA. |
| Open Datasets | Yes | Datasets We use CC-12M [4] for training, a dataset consisting of 12.4 million image-text pairs. We successfully download 9.2M pairs and use this subset as our training set. For evaluation, we use Flickr30k [33] and MSCOCO [23] to evaluate zero-shot cross-modal retrieval tasks. |
| Dataset Splits | Yes | We conduct experiments on zero-shot cross-modal retrieval tasks on Flickr30k and MSCOCO based on the splits in [16] following previous approaches. |
| Hardware Specification | Yes | We use 8 A100 GPUs of 40GB memory to train Lex VLA. |
| Software Dependencies | No | The paper mentions `DINOv2 [30]` and `Llama 2 [40] 7B model` as backbones, and `Adam optimizer [17]`. While the backbones are specific models, explicit version numbers for general software libraries or frameworks like PyTorch, TensorFlow, or specific Python versions are not provided. Adam optimizer parameters are given, but not its software version. |
| Experiment Setup | Yes | We use Adam optimizer [17] with learning rate 5e 4 and cosine decayed, batch size of 6,144, precision of BFloat16 for 12 epochs. We initialize τ as 0.07, and clip the logits larger than 100 as in [34]. We quadratically warmup λ in the fisrt 2k steps and then freeze as [31]. We set λI as 5e 4 and λT as 1e 3. |