CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models
Authors: Sreyan Ghosh, Ashish Seth, Sonal Kumar, Utkarsh Tyagi, Chandra Kiran Reddy Evuru, Ramaneswaran S, S Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we propose Comp A, a collection of two expert-annotated benchmarks with a majority of real-world audio samples, to evaluate compositional reasoning in ALMs. Our proposed Comp A-order evaluates how well an ALM understands the order or occurrence of acoustic events in audio, and Comp A-attribute evaluates attribute-binding of acoustic events. Using this benchmark, we first show that current ALMs perform only marginally better than random chance, thereby struggling with compositional reasoning. Next, we propose Comp A-CLAP, where we fine-tune CLAP using a novel learning method to improve its compositional reasoning abilities. Comp A-CLAP significantly improves over all our baseline models on the Comp A benchmark, indicating its superior compositional reasoning capabilities. |
| Researcher Affiliation | Collaboration | University of Maryland, College Park, USA Adobe, USA NVIDIA, Bangalore, India |
| Pseudocode | Yes | Algorithm 1 demonstrates the algorithm for our template-based synthetic audio-caption creation process. |
| Open Source Code | Yes | We opensource our code and data: https://sreyan88.github.io/compa_iclr/. |
| Open Datasets | Yes | More than 90% of audio snippets in Comp A are sourced from real-world audio samples from Audio Set (Gemmeke et al., 2017) by expert annotators experienced in audio and language research. For pre-training, we make minor modifications to the LAION-audio-630K pre-training dataset proposed by Wu et al. (2023). We introduce Comp A-661k, with 661k unique audio-caption pairs. For training, we use only the compositional audios from Clotho and Audio Caps in addition to our Audio Set-Comp A dataset. |
| Dataset Splits | No | The paper mentions 'Clotho Validation' and 'Audio Caps Validation' in Table 1 but does not provide specific numerical split percentages or counts for validation data. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions using HTSAT-large for the audio encoder and Flan-T5-large for the text encoder, and tools like GPT-4, LLaMa-2, and spacy. However, it does not provide version numbers for the programming language (e.g., Python), deep learning frameworks (e.g., PyTorch), or other libraries/packages used, which would be necessary for full reproducibility. |
| Experiment Setup | Yes | For vanilla contrastive pre-training with Comp A-661k, we use a batch size of 24, and Adam optimizer with a learning rate of 1e-4, and warm-up of 3200 steps, and train for 45 epochs. For training with compositionally aware hard negatives, we start with vanilla CLAP weights and train for 20 epochs with no warm-up. We follow a similar setup for modular contrastive training. |