reproducibilityindex.ai

Unified Lexical Representation for Interpretable Visual-Language Alignment

Authors: Yifan Li, Yikai Wang, Yanwei Fu, Dongyu Ru, Zheng Zhang, Tong He

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On cross-modal retrieval benchmarks, Lex VLA, trained on the CC-12M multi-modal dataset, outperforms baselines fine-tuned on larger datasets (e.g., YFCC15M) and those trained from scratch on even bigger datasets (e.g., 1.1B data, including CC12M). We conduct extensive experiments to analyze Lex VLA.
Researcher Affiliation	Collaboration	1Fudan University 2 Amazon Web Services yifanli23@m.fudan.edu.cn, yi-kai.wang@outlook.com, yanweifu@fudan.edu.cn, {rudongyu, zhaz, htong}@amazon.com
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Codes are available at https://github.com/Clementine24/Lex VLA.
Open Datasets	Yes	Datasets We use CC-12M [4] for training, a dataset consisting of 12.4 million image-text pairs. We successfully download 9.2M pairs and use this subset as our training set. For evaluation, we use Flickr30k [33] and MSCOCO [23] to evaluate zero-shot cross-modal retrieval tasks.
Dataset Splits	Yes	We conduct experiments on zero-shot cross-modal retrieval tasks on Flickr30k and MSCOCO based on the splits in [16] following previous approaches.
Hardware Specification	Yes	We use 8 A100 GPUs of 40GB memory to train Lex VLA.
Software Dependencies	No	The paper mentions `DINOv2 [30]` and `Llama 2 [40] 7B model` as backbones, and `Adam optimizer [17]`. While the backbones are specific models, explicit version numbers for general software libraries or frameworks like PyTorch, TensorFlow, or specific Python versions are not provided. Adam optimizer parameters are given, but not its software version.
Experiment Setup	Yes	We use Adam optimizer [17] with learning rate 5e 4 and cosine decayed, batch size of 6,144, precision of BFloat16 for 12 epochs. We initialize τ as 0.07, and clip the logits larger than 100 as in [34]. We quadratically warmup λ in the fisrt 2k steps and then freeze as [31]. We set λI as 5e 4 and λT as 1e 3.