SALMONN: Towards Generic Hearing Abilities for Large Language Models
Authors: Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun MA, Chao Zhang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that SALMONN as a single model can perform all these tasks and achieve competitive performance on standard benchmarks, which reveals the feasibility of building artificial intelligence (AI) that can hear and understand general audio inputs consisting of mixtures of speech, audio events, and music. |
| Researcher Affiliation | Collaboration | Changli Tang1 , Wenyi Yu1 , Guangzhi Sun1, Xianzhao Chen2, Tian Tan2 Wei Li2, Lu Lu2, Zejun Ma2, Chao Zhang1 Department of Electronic Engineering, Tsinghua University1 Byte Dance2 |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The source code, model checkpoints and data are available at https://github.com/bytedance/SALMONN. |
| Open Datasets | Yes | The data used for the first pre-training stage consists of both 960-hour Libri Speech training set (Panayotov et al., 2015) and 1000-hour Giga Speech M-set (Chen et al., 2021) for speech recognition, as well as 2800-hour Wav Caps (Mei et al., 2023) (with audio clips longer than 180 seconds removed), Audio Caps (Kim et al., 2019) and Clotho (Drossos et al., 2020) datasets for audio captioning. |
| Dataset Splits | No | The paper lists training and testing datasets but does not provide specific details on validation dataset splits, percentages, or sample counts for reproducibility. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory specifications) used for conducting the experiments. |
| Software Dependencies | No | The paper mentions several models and techniques (e.g., Whisper, BEATs, Vicuna, LoRA) but does not provide specific version numbers for ancillary software dependencies such as programming languages, libraries, or frameworks (e.g., Python version, PyTorch version, TensorFlow version, CUDA version). |
| Experiment Setup | Yes | For the windowlevel Q-Former, we use N = 1 resulting in only one trainable query, and use L = 17 which is approximately 0.33 seconds per window. This leads to 88 textual tokens output by Q-Former for a 30-second audio. Regarding the hyper-parameters of Lo RA (Hu et al., 2022), we set the rank to 8 and the scaling factor to 4.0. |