Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
QUEST: Quadruple Multimodal Contrastive Learning with Constraints and Self-Penalization
Authors: Qi Song, Tianxiang Gong, Shiqi Gao, Haoyi Zhou, Jianxin Li
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on multiple datasets show that our method achieves superior performance in multimodal contrastive learning benchmarks. |
| Researcher Affiliation | Academia | Qi Song1 , Tianxiang Gong2 , Shiqi Gao2, Haoyi Zhou1,3 , Jianxin Li2,3 1School of Software, Beihang University 2School of Computer Science and Engineering, Beihang University 3Zhongguancun Laboratory, Beijing EMAIL |
| Pseudocode | Yes | Algorithm 1 LUIC loss calculation; Algorithm 2 Calculate similarity map; Algorithm 3 Lcos loss calculation |
| Open Source Code | Yes | We provide source code of our paper. 2 https://github.com/Vortexsong/QUEST |
| Open Datasets | Yes | Flickr30k is a benchmark commonly used in computer vision (CV) and natural language processing (NLP)... Microsoft Common Objects in Context (MS-COCO) is a large-scale dataset... Free Music Archive (FMA) is an extensive, open-access dataset... GTZAN is a benchmark dataset widely used in Music Information Retrieval (MIR)... Clotho: an audio captioning dataset... Audio Caps is a seminal dataset for audio captioning... |
| Dataset Splits | Yes | FMA s comprehensive nature makes it ideal for various MIR tasks such as genre classification, artist identification, and music recommendation, while its predefined train/validation/test splits and subsets of varying sizes facilitate reproducible research and benchmarking in the field. |
| Hardware Specification | Yes | All experiments in this paper are run on a single NVIDIA A100 GPU. |
| Software Dependencies | Yes | The implementation is based on Py Torch 2.0.1. |
| Experiment Setup | Yes | Table 3: Multimodal Model Training Details. ... VSE++ 30 128 adam 2e-4 0 step LR ... CLIP 5 256 adamw 2e-5 100 cosine_annealing. ... We choose the hyperparameters alpha_t as 0.08 on most experiments and set positive_sample to false. |