Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning

Authors: Tien Manh Luong, Khai Nguyen, Dinh Phung, Reza Haffari, Lizhen Qu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on Audio Caps and Clotho datasets demonstrate that our method significantly improves caption quality, lexical diversity, and text-to-audio retrieval accuracy. Furthermore, we demonstrate the generalizability of our USW-RBF kernel by applying it to audio reasoning tasks, where it enhances the reasoning capabilities of large audio language models on the Comp A-R in terms of correctness and quality.
Researcher Affiliation Academia 1Monash University, Australia, 2 University of Texas at Austin, USA EMAIL {khainb}@utexas.edu
Pseudocode No The paper describes methods and equations but does not present them in a structured pseudocode or algorithm block. For example, the training objective is given as an equation (13) and the inference stage as equation (14), but without an explicit algorithm format.
Open Source Code Yes The code of our ACUS framework is released in https://github.com/v-manhlt3/ACUS
Open Datasets Yes Extensive experiments on Audio Caps and Clotho datasets demonstrate that our method significantly improves caption quality, lexical diversity, and text-to-audio retrieval accuracy.
Dataset Splits No The paper mentions using "Audio Caps test set" and "Clotho datasets" but does not explicitly state the train/validation/test splits, their percentages, or sample counts. It refers to "original settings" for training, which implies standard splits from other papers, but does not detail them in this text.
Hardware Specification Yes Table 12: The real-time-factor(RTF) on a single A6000 GPU at the inference step among MLE, MLE with contrastive loss, and MLE with ACUS framework.
Software Dependencies No The paper mentions software components like "Adam optimizer" and models like "GPT2", "BART", "CLAP", but does not provide specific version numbers for any programming languages or libraries (e.g., Python, PyTorch, CUDA, TensorFlow).
Experiment Setup Yes The Adam optimizer with β1 = 0.9, β2 = 0.999, and a weight decay coefficient of 0.01 is used to train the model for both datasets. For Audio Caps, we use a batch size of 64 and warm up for 2000 steps before reaching the peak learning rate at lr = 2e 5. For Clotho, we use a batch size of 48 with the gradient accumulation step of 2 and warm up for 1000 steps before reaching the peak learning rate at lr = 2e 5. We perform a grid search for the hyperparameter γ = {0.5, 1.5, 2.5, 3.5} for the temporal-similarity metric. We choose the best value of γ, which is 2.5 and 1.5 for the Audio Caps and Clotho datasets, respectively. We also perform a grid search for the stochastic decoding methods at the inference state to choose the best decoding hyperparameters for each stochastic decoding method, p = {0.5, 0.6, 0.7, 0.8, 0.9} for top-p sampling, k = {3, 4, 5} for top-k sampling, and temp = {1.1, 1.2, 1.3, 1.4, 1.5} for temperature sampling.