Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models
Authors: Sreyan Ghosh, Ashish Seth, Sonal Kumar, Utkarsh Tyagi, Chandra Kiran Reddy Evuru, Ramaneswaran S, S Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we propose Comp A, a collection of two expert-annotated benchmarks with a majority of real-world audio samples, to evaluate compositional reasoning in ALMs. Our proposed Comp A-order evaluates how well an ALM understands the order or occurrence of acoustic events in audio, and Comp A-attribute evaluates attribute-binding of acoustic events. Using this benchmark, we first show that current ALMs perform only marginally better than random chance, thereby struggling with compositional reasoning. Next, we propose Comp A-CLAP, where we fine-tune CLAP using a novel learning method to improve its compositional reasoning abilities. Comp A-CLAP significantly improves over all our baseline models on the Comp A benchmark, indicating its superior compositional reasoning capabilities. |
| Researcher Affiliation | Collaboration | University of Maryland, College Park, USA Adobe, USA NVIDIA, Bangalore, India |
| Pseudocode | Yes | Algorithm 1 demonstrates the algorithm for our template-based synthetic audio-caption creation process. |
| Open Source Code | Yes | We opensource our code and data: https://sreyan88.github.io/compa_iclr/. |
| Open Datasets | Yes | More than 90% of audio snippets in Comp A are sourced from real-world audio samples from Audio Set (Gemmeke et al., 2017) by expert annotators experienced in audio and language research. For pre-training, we make minor modifications to the LAION-audio-630K pre-training dataset proposed by Wu et al. (2023). We introduce Comp A-661k, with 661k unique audio-caption pairs. For training, we use only the compositional audios from Clotho and Audio Caps in addition to our Audio Set-Comp A dataset. |
| Dataset Splits | No | The paper mentions 'Clotho Validation' and 'Audio Caps Validation' in Table 1 but does not provide specific numerical split percentages or counts for validation data. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using HTSAT-large for the audio encoder and Flan-T5-large for the text encoder, and tools like GPT-4, LLaMa-2, and spacy. However, it does not provide version numbers for the programming language (e.g., Python), deep learning frameworks (e.g., PyTorch), or other libraries/packages used, which would be necessary for full reproducibility. |
| Experiment Setup | Yes | For vanilla contrastive pre-training with Comp A-661k, we use a batch size of 24, and Adam optimizer with a learning rate of 1e-4, and warm-up of 3200 steps, and train for 45 epochs. For training with compositionally aware hard negatives, we start with vanilla CLAP weights and train for 20 epochs with no warm-up. We follow a similar setup for modular contrastive training. |