reproducibilityindex.ai

SALMONN: Towards Generic Hearing Abilities for Large Language Models

Authors: Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun MA, Chao Zhang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that SALMONN as a single model can perform all these tasks and achieve competitive performance on standard benchmarks, which reveals the feasibility of building artificial intelligence (AI) that can hear and understand general audio inputs consisting of mixtures of speech, audio events, and music.
Researcher Affiliation	Collaboration	Changli Tang1 , Wenyi Yu1 , Guangzhi Sun1, Xianzhao Chen2, Tian Tan2 Wei Li2, Lu Lu2, Zejun Ma2, Chao Zhang1 Department of Electronic Engineering, Tsinghua University1 Byte Dance2
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	The source code, model checkpoints and data are available at https://github.com/bytedance/SALMONN.
Open Datasets	Yes	The data used for the first pre-training stage consists of both 960-hour Libri Speech training set (Panayotov et al., 2015) and 1000-hour Giga Speech M-set (Chen et al., 2021) for speech recognition, as well as 2800-hour Wav Caps (Mei et al., 2023) (with audio clips longer than 180 seconds removed), Audio Caps (Kim et al., 2019) and Clotho (Drossos et al., 2020) datasets for audio captioning.
Dataset Splits	No	The paper lists training and testing datasets but does not provide specific details on validation dataset splits, percentages, or sample counts for reproducibility.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory specifications) used for conducting the experiments.
Software Dependencies	No	The paper mentions several models and techniques (e.g., Whisper, BEATs, Vicuna, LoRA) but does not provide specific version numbers for ancillary software dependencies such as programming languages, libraries, or frameworks (e.g., Python version, PyTorch version, TensorFlow version, CUDA version).
Experiment Setup	Yes	For the windowlevel Q-Former, we use N = 1 resulting in only one trainable query, and use L = 17 which is approximately 0.33 seconds per window. This leads to 88 textual tokens output by Q-Former for a 30-second audio. Regarding the hyper-parameters of Lo RA (Hu et al., 2022), we set the rank to 8 and the scaling factor to 4.0.