Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AudSemThinker: Enhancing Audio-Language Models Through Reasoning over Semantics of Sound

Authors: Gijs Wijngaard, Elia Formisano, Michele Esposito, Michel Dumontier

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that AUDSEMTHINKER outperforms state-of-the-art models across multiple training settings, highlighting its strength in semantic audio reasoning. Both AUDSEMTHINKER and the AUDSEM dataset are released publicly1. ... 5 Experiments This section details the evaluation of AUDSEMTHINKER s performance and the AUDSEM dataset s effectiveness. The experimental setup is described first, followed by the main results on established benchmarks and concluding with an ablation study.
Researcher Affiliation	Academia	Gijs Wijngaard Elia Formisano Michele Esposito Michel Dumontier Maastricht University
Pseudocode	No	The paper describes methodologies and pipelines in narrative text and visual diagrams (e.g., Figure 2, Section 3, Appendix B) but does not include any formally structured pseudocode or algorithm blocks.
Open Source Code	Yes	Both AUDSEMTHINKER and the AUDSEM dataset are released publicly1. 1https://github.com/GLJS/Aud Sem Thinker
Open Datasets	Yes	Both AUDSEMTHINKER and the AUDSEM dataset are released publicly1. ... The corresponding JSON metadata is uploaded to the Hugging Face Dataset Repository3. 3https://huggingface.co/datasets/gijs/audsem
Dataset Splits	Yes	Scores are reported on both Test-Mini (1k clips) and Test (10k clips) sets; some authors only report Test-Mini. MMAU is a multiple choice question benchmark, where the model has to select the answer out of four possible answers. MMAU uses the accuracy metric, as scores are being calculated by the match between the predicted answer and the correct answer.
Hardware Specification	Yes	Training uses a batch size of four on a single H100 GPU, taking approximately 12 hours for the full dataset (AUDSEMTHINKER) and six hours for the QA subset (AUDSEMTHINKER-QA). ... The model is trained on four H100 GPUs for approximately 10 hours, utilizing Deep Speed Ze RO-3 [58] and v LLM [39] for efficient training and inference.
Software Dependencies	Yes	The model is compiled using Py Torch 2.0 for optimized performance and processes videos in batches of 32 using multiple worker processes for efficient data loading. ... The final dataset generation pipeline uses Qwen2.5-72B-Instruct, implementing structured generation with a schema through the xgrammar library [15] with the help of v LLM [39].
Experiment Setup	Yes	Both models are trained using the Adam W optimizer [46] with a learning rate of 2e-04 for one epoch, using a scheduler with linear decay and bf16 precision. Training uses a batch size of four on a single H100 GPU... For GRPO training, we use the default GRPO loss type with β = 0.01, generate six completions per prompt (k = 6), and employ a batch size of two per device with bf16 precision.