CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models

Authors: Sreyan Ghosh, Ashish Seth, Sonal Kumar, Utkarsh Tyagi, Chandra Kiran Reddy Evuru, Ramaneswaran S, S Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we propose Comp A, a collection of two expert-annotated benchmarks with a majority of real-world audio samples, to evaluate compositional reasoning in ALMs. Our proposed Comp A-order evaluates how well an ALM understands the order or occurrence of acoustic events in audio, and Comp A-attribute evaluates attribute-binding of acoustic events. Using this benchmark, we first show that current ALMs perform only marginally better than random chance, thereby struggling with compositional reasoning. Next, we propose Comp A-CLAP, where we fine-tune CLAP using a novel learning method to improve its compositional reasoning abilities. Comp A-CLAP significantly improves over all our baseline models on the Comp A benchmark, indicating its superior compositional reasoning capabilities.
Researcher Affiliation Collaboration University of Maryland, College Park, USA Adobe, USA NVIDIA, Bangalore, India
Pseudocode Yes Algorithm 1 demonstrates the algorithm for our template-based synthetic audio-caption creation process.
Open Source Code Yes We opensource our code and data: https://sreyan88.github.io/compa_iclr/.
Open Datasets Yes More than 90% of audio snippets in Comp A are sourced from real-world audio samples from Audio Set (Gemmeke et al., 2017) by expert annotators experienced in audio and language research. For pre-training, we make minor modifications to the LAION-audio-630K pre-training dataset proposed by Wu et al. (2023). We introduce Comp A-661k, with 661k unique audio-caption pairs. For training, we use only the compositional audios from Clotho and Audio Caps in addition to our Audio Set-Comp A dataset.
Dataset Splits No The paper mentions 'Clotho Validation' and 'Audio Caps Validation' in Table 1 but does not provide specific numerical split percentages or counts for validation data.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions using HTSAT-large for the audio encoder and Flan-T5-large for the text encoder, and tools like GPT-4, LLaMa-2, and spacy. However, it does not provide version numbers for the programming language (e.g., Python), deep learning frameworks (e.g., PyTorch), or other libraries/packages used, which would be necessary for full reproducibility.
Experiment Setup Yes For vanilla contrastive pre-training with Comp A-661k, we use a batch size of 24, and Adam optimizer with a learning rate of 1e-4, and warm-up of 3200 steps, and train for 45 epochs. For training with compositionally aware hard negatives, we start with vanilla CLAP weights and train for 20 epochs with no warm-up. We follow a similar setup for modular contrastive training.