Unified Lexical Representation for Interpretable Visual-Language Alignment

Authors: Yifan Li, Yikai Wang, Yanwei Fu, Dongyu Ru, Zheng Zhang, Tong He

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On cross-modal retrieval benchmarks, Lex VLA, trained on the CC-12M multi-modal dataset, outperforms baselines fine-tuned on larger datasets (e.g., YFCC15M) and those trained from scratch on even bigger datasets (e.g., 1.1B data, including CC12M). We conduct extensive experiments to analyze Lex VLA.
Researcher Affiliation Collaboration 1Fudan University 2 Amazon Web Services yifanli23@m.fudan.edu.cn, yi-kai.wang@outlook.com, yanweifu@fudan.edu.cn, {rudongyu, zhaz, htong}@amazon.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Codes are available at https://github.com/Clementine24/Lex VLA.
Open Datasets Yes Datasets We use CC-12M [4] for training, a dataset consisting of 12.4 million image-text pairs. We successfully download 9.2M pairs and use this subset as our training set. For evaluation, we use Flickr30k [33] and MSCOCO [23] to evaluate zero-shot cross-modal retrieval tasks.
Dataset Splits Yes We conduct experiments on zero-shot cross-modal retrieval tasks on Flickr30k and MSCOCO based on the splits in [16] following previous approaches.
Hardware Specification Yes We use 8 A100 GPUs of 40GB memory to train Lex VLA.
Software Dependencies No The paper mentions `DINOv2 [30]` and `Llama 2 [40] 7B model` as backbones, and `Adam optimizer [17]`. While the backbones are specific models, explicit version numbers for general software libraries or frameworks like PyTorch, TensorFlow, or specific Python versions are not provided. Adam optimizer parameters are given, but not its software version.
Experiment Setup Yes We use Adam optimizer [17] with learning rate 5e 4 and cosine decayed, batch size of 6,144, precision of BFloat16 for 12 epochs. We initialize τ as 0.07, and clip the logits larger than 100 as in [34]. We quadratically warmup λ in the fisrt 2k steps and then freeze as [31]. We set λI as 5e 4 and λT as 1e 3.