Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Mellow: a small audio language model for reasoning
Authors: Soham Deshmukh, Satvik Dixit, Rita Singh, Bhiksha Raj
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate Mellow s reasoning ability, we benchmark it on a diverse set of tasks, assessing on both in-distribution and out-of-distribution data, including audio understanding, deductive reasoning, and comparative reasoning. Finally, we conduct extensive ablation studies to explore the impact of projection layer choices, synthetic data generation methods, and language model pretraining on reasoning performance. |
| Researcher Affiliation | Academia | Soham Deshmukh Satvik Dixit Rita Singh Bhiksha Raj Carnegie Mellon University EMAIL |
| Pseudocode | No | The paper describes methods and training, but does not present a formal pseudocode block or algorithm. |
| Open Source Code | No | Yes, we released the Reason AQA dataset and will subsequently release the training and evaluation code in June to reproduce all experiments. |
| Open Datasets | Yes | We provide the Reason AQA dataset, including the synthetic question-answer pairs derived from Audio Caps and Clotho, along with details on how the dataset was generated using large language models. We use two audio datasets with human-labeled descriptions Audio Caps [43] and Clotho [23]. |
| Dataset Splits | Yes | The dataset composition for each split is shown in Table 1. |
| Hardware Specification | No | The paper does not provide specific details on the hardware used for experiments, such as GPU or CPU models. While it mentions training models, it lacks explicit hardware specifications. |
| Software Dependencies | No | The paper mentions the use of Adam Optimiser [44] and specific language models like Llama 3 8B [24] and Smol LM2 [2], as well as the HTSAT [9] audio encoder, but it does not list specific version numbers for software dependencies such as programming languages or libraries (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | Mellow is trained on the next-token prediction task, where the next-token is predicted based on past-tokens and the two input audios (A1, A2)... We use Adam Optimiser [44] with cosine learning rate schedule with a maximum learning rate of 1e-3. We train Mellow and all the ablation models for 30 epochs. |