Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
Authors: Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, Dinesh Manocha
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We assess 18 open-source and proprietary (Large) Audio-Language Models, demonstrating the significant challenges posed by MMAU. Notably, even the most advanced Gemini 2.0 Flash achieves only 59.93% accuracy, and the state-of-the-art open-source Qwen2-Audio achieves only 52.50%, highlighting considerable room for improvement. We believe MMAU will drive the audio and multimodal research community to develop more advanced audio understanding models capable of solving complex audio tasks. |
| Researcher Affiliation | Collaboration | University of Maryland, College Park, USA Adobe, USA |
| Pseudocode | Yes | Algorithm 1: String Match Evaluation Algorithm |
| Open Source Code | Yes | The benchmark will be publicly released upon paper acceptance. The test-mini subset will be completely open-sourced on Git Hub, together with ground-truth responses and all meta-data. |
| Open Datasets | Yes | We began by collecting diverse audio corpora, including speech, music, and environmental sounds, prioritizing real recordings over synthetic data. ... we gathered 13 audio corpora to ensure a strong foundation for task development (more details in Appendix F). |
| Dataset Splits | Yes | MMAU comprises 10,000 multiple-choice questions (MCQs) divided into a test-mini set and a main test set. The test-mini set, comprising 1,000 questions, reflects the task distribution of the main test set and is intended for hyperparameter tuning. ... For evaluation, 1,000 instances were chosen to form the test-mini set, evenly distributed across all tasks, while the remaining instances were allocated to the main test set. |
| Hardware Specification | No | The paper mentions various models used (e.g., Qwen2-Audio-Chat, GAMA, Gemini-Flash) but does not provide specific details about the hardware (e.g., GPU models, CPU types, or cloud compute instances) used to run the experiments. |
| Software Dependencies | No | The paper mentions several models and tools like GPT-4o, En CLAP, Mu LLa Ma, Whisper base, Parler-TTS, and a text-to-audio model. However, it does not provide specific version numbers for these software components or other libraries used in the experimental setup, which are necessary for full reproducibility. |
| Experiment Setup | Yes | We use micro-averaged accuracy as our evaluation metric. Specifically, we present the question along with the list of choices to the models, instructing them to select the correct choice. ... To mitigate any potential bias in the model s option selection due to ordering, we randomize the order of the options five times and select the option chosen most frequently. Additionally, we experiment with various prompt sets across all LALMs and report the best results. |