Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

Authors: Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, Dinesh Manocha

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We assess 18 open-source and proprietary (Large) Audio-Language Models, demonstrating the significant challenges posed by MMAU. Notably, even the most advanced Gemini 2.0 Flash achieves only 59.93% accuracy, and the state-of-the-art open-source Qwen2-Audio achieves only 52.50%, highlighting considerable room for improvement. We believe MMAU will drive the audio and multimodal research community to develop more advanced audio understanding models capable of solving complex audio tasks.
Researcher Affiliation Collaboration University of Maryland, College Park, USA Adobe, USA
Pseudocode Yes Algorithm 1: String Match Evaluation Algorithm
Open Source Code Yes The benchmark will be publicly released upon paper acceptance. The test-mini subset will be completely open-sourced on Git Hub, together with ground-truth responses and all meta-data.
Open Datasets Yes We began by collecting diverse audio corpora, including speech, music, and environmental sounds, prioritizing real recordings over synthetic data. ... we gathered 13 audio corpora to ensure a strong foundation for task development (more details in Appendix F).
Dataset Splits Yes MMAU comprises 10,000 multiple-choice questions (MCQs) divided into a test-mini set and a main test set. The test-mini set, comprising 1,000 questions, reflects the task distribution of the main test set and is intended for hyperparameter tuning. ... For evaluation, 1,000 instances were chosen to form the test-mini set, evenly distributed across all tasks, while the remaining instances were allocated to the main test set.
Hardware Specification No The paper mentions various models used (e.g., Qwen2-Audio-Chat, GAMA, Gemini-Flash) but does not provide specific details about the hardware (e.g., GPU models, CPU types, or cloud compute instances) used to run the experiments.
Software Dependencies No The paper mentions several models and tools like GPT-4o, En CLAP, Mu LLa Ma, Whisper base, Parler-TTS, and a text-to-audio model. However, it does not provide specific version numbers for these software components or other libraries used in the experimental setup, which are necessary for full reproducibility.
Experiment Setup Yes We use micro-averaged accuracy as our evaluation metric. Specifically, we present the question along with the list of choices to the models, instructing them to select the correct choice. ... To mitigate any potential bias in the model s option selection due to ordering, we randomize the order of the options five times and select the option chosen most frequently. Additionally, we experiment with various prompt sets across all LALMs and report the best results.