QUEST: Quadruple Multimodal Contrastive Learning with Constraints and Self-Penalization
Authors: Qi Song, Tianxiang Gong, Shiqi Gao, Haoyi Zhou, Jianxin Li
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on multiple datasets show that our method achieves superior performance in multimodal contrastive learning benchmarks. |
| Researcher Affiliation | Academia | Qi Song1 , Tianxiang Gong2 , Shiqi Gao2, Haoyi Zhou1,3 , Jianxin Li2,3 1School of Software, Beihang University 2School of Computer Science and Engineering, Beihang University 3Zhongguancun Laboratory, Beijing {songqi23, gongtx, gaoshiqi, haoyi, lijx}@buaa.edu |
| Pseudocode | Yes | Algorithm 1 LUIC loss calculation; Algorithm 2 Calculate similarity map; Algorithm 3 Lcos loss calculation |
| Open Source Code | Yes | We provide source code of our paper. 2 https://github.com/Vortexsong/QUEST |
| Open Datasets | Yes | Flickr30k is a benchmark commonly used in computer vision (CV) and natural language processing (NLP)... Microsoft Common Objects in Context (MS-COCO) is a large-scale dataset... Free Music Archive (FMA) is an extensive, open-access dataset... GTZAN is a benchmark dataset widely used in Music Information Retrieval (MIR)... Clotho: an audio captioning dataset... Audio Caps is a seminal dataset for audio captioning... |
| Dataset Splits | Yes | FMA s comprehensive nature makes it ideal for various MIR tasks such as genre classification, artist identification, and music recommendation, while its predefined train/validation/test splits and subsets of varying sizes facilitate reproducible research and benchmarking in the field. |
| Hardware Specification | Yes | All experiments in this paper are run on a single NVIDIA A100 GPU. |
| Software Dependencies | Yes | The implementation is based on Py Torch 2.0.1. |
| Experiment Setup | Yes | Table 3: Multimodal Model Training Details. ... VSE++ 30 128 adam 2e-4 0 step LR ... CLIP 5 256 adamw 2e-5 100 cosine_annealing. ... We choose the hyperparameters alpha_t as 0.08 on most experiments and set positive_sample to false. |