Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Mellow: a small audio language model for reasoning

Authors: Soham Deshmukh, Satvik Dixit, Rita Singh, Bhiksha Raj

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate Mellow s reasoning ability, we benchmark it on a diverse set of tasks, assessing on both in-distribution and out-of-distribution data, including audio understanding, deductive reasoning, and comparative reasoning. Finally, we conduct extensive ablation studies to explore the impact of projection layer choices, synthetic data generation methods, and language model pretraining on reasoning performance.
Researcher Affiliation	Academia	Soham Deshmukh Satvik Dixit Rita Singh Bhiksha Raj Carnegie Mellon University EMAIL
Pseudocode	No	The paper describes methods and training, but does not present a formal pseudocode block or algorithm.
Open Source Code	No	Yes, we released the Reason AQA dataset and will subsequently release the training and evaluation code in June to reproduce all experiments.
Open Datasets	Yes	We provide the Reason AQA dataset, including the synthetic question-answer pairs derived from Audio Caps and Clotho, along with details on how the dataset was generated using large language models. We use two audio datasets with human-labeled descriptions Audio Caps [43] and Clotho [23].
Dataset Splits	Yes	The dataset composition for each split is shown in Table 1.
Hardware Specification	No	The paper does not provide specific details on the hardware used for experiments, such as GPU or CPU models. While it mentions training models, it lacks explicit hardware specifications.
Software Dependencies	No	The paper mentions the use of Adam Optimiser [44] and specific language models like Llama 3 8B [24] and Smol LM2 [2], as well as the HTSAT [9] audio encoder, but it does not list specific version numbers for software dependencies such as programming languages or libraries (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	Mellow is trained on the next-token prediction task, where the next-token is predicted based on past-tokens and the two input audios (A1, A2)... We use Adam Optimiser [44] with cosine learning rate schedule with a maximum learning rate of 1e-3. We train Mellow and all the ablation models for 30 epochs.