BEATs: Audio Pre-Training with Acoustic Tokenizers
Authors: Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, Wanxiang Che, Xiangzhan Yu, Furu Wei
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experimental results demonstrate our acoustic tokenizers can generate discrete labels with rich audio semantics and our audio SSL models achieve state-of-the-art (SOTA) results across various audio classification benchmarks |
| Researcher Affiliation | Collaboration | 1Harbin Institute of Technology, Harbin, Heilongjiang, China 2Microsoft Research Asia, Beijing, China 3Nankai University, Tianjin, China 4Microsoft Corporation, Redmond, WA, USA. |
| Pseudocode | No | The paper describes the methodology in text and figures but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code and pre-trained models are available at https://aka.ms/beats. |
| Open Datasets | Yes | We pre-train our BEATS tokenizers and audio SSL models on the full training set of the Audio Set dataset (Gemmeke et al., 2017), and evaluate our pre-trained audio SSL models on six downstream tasks, including three audio classification tasks (AS-2M, AS-20K (Gemmeke et al., 2017) and ESC-50 (Piczak, 2015)) and three speech classification tasks (KS1, KS2 (Warden, 2018) and ER (Busso et al., 2008)). |
| Dataset Splits | Yes | Speech Commands V2 (KS2) (Warden, 2018) is a keyword spotting dataset that contains 105,829 1-second spoken word clips annotated with 35 common word classes. It is officially subdivided into the training, validation, and testing set that contains 84,843, 9,981, and 11,005 audio clips respectively. ... Environmental Sound Classification (ESC-50) (Piczak, 2015)... We follow the 5-fold cross-validation evaluation setting as the previous works... ...IEMOCAP (ER) (Busso et al., 2008)... we use the 5-fold cross-validation evaluation setting as SUPERB benchmark (wen Yang et al., 2021). |
| Hardware Specification | Yes | Each of the BEATS models is trained with 16 Tesla V100-SXM2-32GB GPUs for around 75 hours and the self-distilled tokenizer is trained with 8 Tesla V100-SXM2-32GB GPUs for around 45 hours. |
| Software Dependencies | No | Table 4 lists hyperparameters such as 'Optimizer Adam W' and mentions augmentation methods like 'Spec Aug (Park et al., 2019)' and 'Mixup (Zhang et al., 2017)', but it does not specify software names with version numbers (e.g., PyTorch 1.9, TensorFlow 2.x). |
| Experiment Setup | Yes | Table 4 shows the detailed hyperparameters that are used for BEATS acoustic tokenizer training, audio SSL model pre-training and fine-tuning, which are adapted from the previous works (Xu et al., 2022; Chen et al., 2022b; Peng et al., 2022). It lists: Optimizer Adam W, Weight decay 0.01, Learning Rate Schedule Linear Decay, Steps 400K, Warmup epochs 32K, Batch size (s) 5.6K, Peak learning rate 5e-4, Dropout 0.1, Spec Aug 0.3, Mixup 0.8. |