QUEST: Quadruple Multimodal Contrastive Learning with Constraints and Self-Penalization

Authors: Qi Song, Tianxiang Gong, Shiqi Gao, Haoyi Zhou, Jianxin Li

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on multiple datasets show that our method achieves superior performance in multimodal contrastive learning benchmarks.
Researcher Affiliation Academia Qi Song1 , Tianxiang Gong2 , Shiqi Gao2, Haoyi Zhou1,3 , Jianxin Li2,3 1School of Software, Beihang University 2School of Computer Science and Engineering, Beihang University 3Zhongguancun Laboratory, Beijing {songqi23, gongtx, gaoshiqi, haoyi, lijx}@buaa.edu
Pseudocode Yes Algorithm 1 LUIC loss calculation; Algorithm 2 Calculate similarity map; Algorithm 3 Lcos loss calculation
Open Source Code Yes We provide source code of our paper. 2 https://github.com/Vortexsong/QUEST
Open Datasets Yes Flickr30k is a benchmark commonly used in computer vision (CV) and natural language processing (NLP)... Microsoft Common Objects in Context (MS-COCO) is a large-scale dataset... Free Music Archive (FMA) is an extensive, open-access dataset... GTZAN is a benchmark dataset widely used in Music Information Retrieval (MIR)... Clotho: an audio captioning dataset... Audio Caps is a seminal dataset for audio captioning...
Dataset Splits Yes FMA s comprehensive nature makes it ideal for various MIR tasks such as genre classification, artist identification, and music recommendation, while its predefined train/validation/test splits and subsets of varying sizes facilitate reproducible research and benchmarking in the field.
Hardware Specification Yes All experiments in this paper are run on a single NVIDIA A100 GPU.
Software Dependencies Yes The implementation is based on Py Torch 2.0.1.
Experiment Setup Yes Table 3: Multimodal Model Training Details. ... VSE++ 30 128 adam 2e-4 0 step LR ... CLIP 5 256 adamw 2e-5 100 cosine_annealing. ... We choose the hyperparameters alpha_t as 0.08 on most experiments and set positive_sample to false.