Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Unified Lexical Representation for Interpretable Visual-Language Alignment
Authors: Yifan Li, Yikai Wang, Yanwei Fu, Dongyu Ru, Zheng Zhang, Tong He
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On cross-modal retrieval benchmarks, Lex VLA, trained on the CC-12M multi-modal dataset, outperforms baselines fine-tuned on larger datasets (e.g., YFCC15M) and those trained from scratch on even bigger datasets (e.g., 1.1B data, including CC12M). We conduct extensive experiments to analyze Lex VLA. |
| Researcher Affiliation | Collaboration | 1Fudan University 2 Amazon Web Services EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Codes are available at https://github.com/Clementine24/Lex VLA. |
| Open Datasets | Yes | Datasets We use CC-12M [4] for training, a dataset consisting of 12.4 million image-text pairs. We successfully download 9.2M pairs and use this subset as our training set. For evaluation, we use Flickr30k [33] and MSCOCO [23] to evaluate zero-shot cross-modal retrieval tasks. |
| Dataset Splits | Yes | We conduct experiments on zero-shot cross-modal retrieval tasks on Flickr30k and MSCOCO based on the splits in [16] following previous approaches. |
| Hardware Specification | Yes | We use 8 A100 GPUs of 40GB memory to train Lex VLA. |
| Software Dependencies | No | The paper mentions `DINOv2 [30]` and `Llama 2 [40] 7B model` as backbones, and `Adam optimizer [17]`. While the backbones are specific models, explicit version numbers for general software libraries or frameworks like PyTorch, TensorFlow, or specific Python versions are not provided. Adam optimizer parameters are given, but not its software version. |
| Experiment Setup | Yes | We use Adam optimizer [17] with learning rate 5e 4 and cosine decayed, batch size of 6,144, precision of BFloat16 for 12 epochs. We initialize τ as 0.07, and clip the logits larger than 100 as in [34]. We quadratically warmup λ in the fisrt 2k steps and then freeze as [31]. We set λI as 5e 4 and λT as 1e 3. |