Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Discrete Audio Tokens: More Than a Survey!
Authors: Pooneh Mousavi, Gallil Maimon, Adel Moumen, Darius Petermann, Jiatong Shi, Haibin Wu, Haici Yang, Anastasia Kuznetsova, Artem Ploujnikov, Ricard Marxer, Bhuvana Ramabhadran, Benjamin Elizalde, Loren Lugosch, Jinyu Li, Cem Subakan, Phil Woodland, Minje Kim, Hung-yi Lee, Shinji Watanabe, Yossi Adi, Mirco Ravanelli
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches... We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic language modeling, and analyze trade-offs through controlled ablation studies. |
| Researcher Affiliation | Collaboration | 1Concordia University, 2Mila-Quebec AI Institute, 3The Hebrew University of Jerusalem, 4University of Cambridge, 5Indiana University, 6Carnegie Mellon University, 7Microsoft, 8Université de Montréal, 9Université de Toulon, 10Google, 11Apple 12Laval University, 13National Taiwan University, 14University of Illinois at Urbana-Champaign |
| Pseudocode | Yes | Algorithm 1 Residual Vector Quantization (RVQ) 1: Input: Embedding zt, Codebooks {C(m)}M m=1 2: Initialize residual: r(1) t zt 3: for m = 1 to M do 4: q(m) t arg mink r(m) t c(m) k 2 5: ˆz(m) t c(m) q(m) t 6: r(m+1) t r(m) t ˆz(m) t 7: end for 8: Output: ˆzt PM m=1 ˆz(m) t |
| Open Source Code | No | The paper provides a link to a website (https://poonehmousavi.github.io/dates-website/) for "main results and tokenizer database" and to Hugging Face (https://huggingface.co/collections/espnet/codec-survey-pre-trained-models-67ce8e09568b741d1c4483c8) for "released model checkpoints". Neither of these explicitly states the release of source code for the methodology described in the paper. |
| Open Datasets | Yes | For speech evaluation, we use the Libri Speech test-clean set (Panayotov et al., 2015). For music, we use the MUSDB dataset (Rafii et al., 2017), which consists of approximately 10 hours of full-length and professionally-recorded musical tracks at 44.1k Hz. Lastly, for general audio we opt for the Audioset (Gemmeke et al., 2017) test-set, which accounts for approximately 55 hours of audio clips extracted from You Tube. |
| Dataset Splits | Yes | For speech evaluation, we use the Libri Speech test-clean set (Panayotov et al., 2015). For music, we use the MUSDB dataset (Rafii et al., 2017)... For general audio we opt for the Audioset (Gemmeke et al., 2017) test-set... We follow Kreuk et al. (2023) and use the official splits of Audio Caps for validation and testing... For training, we use the genre-balanced Free Music Archive (FMA) dataset (Defferrard et al., 2017), following the setup of stable-audio-open (Evans et al., 2025). All samples are 30 seconds long, and we follow the official split provided in the dataset repository18. The training set consists of 84,213 samples, totaling 702 hours. |
| Hardware Specification | Yes | Table 18 shows computaional setting for our experiments. Downstream Evaluation 1 A100 (80GB) 2 48 hrs. Reconstructed Audio Quality 1 A6000 (48G) 24 hrs. Speech Language Modeling 2 A100 (40GB) 48 hrs. Text-to-Speech 1 A100 (80GB) 96 hrs. Audio Generation 2 A100 (80GB) 48 hrs. Music Generation 4 A100 (40GB) 48 hrs. Ablation Studies 2 GH200 (100GB) 48 hrs |
| Software Dependencies | No | The paper mentions software like "Speech Brain toolkit" and "Hugging Face Transformers" but does not provide specific version numbers for them, which is required for a reproducible description of ancillary software. |
| Experiment Setup | Yes | Motivated by Maimon et al. (2025a), each SLM is built upon the Qwen-2.5 architecture (Yang et al., 2024a) (357M parameters in total, after removing the text embedding tables) and initialized using TWIST (Hassid et al., 2023)... The models are trained for a total of 50, 000 optimizer steps, with a context length set to 1024. The audio target batch size is set to include about 2.9 hours of speech per backpropagation step. We used the Adam optimizer coupled with a linear learning rate scheduler, applying a 1% warmup ratio (corresponding to 500 steps)... We first perform 10 epochs of AR-only training, followed by 90 epochs of joint training with both AR and NAR layers to improve convergence for some tokenizers. All models use a 12-layer architecture for both AR and NAR decoders, with an attention dimension of 1024 and a dropout rate of 0.2. |