BEATs: Audio Pre-Training with Acoustic Tokenizers

Authors: Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, Wanxiang Che, Xiangzhan Yu, Furu Wei

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experimental results demonstrate our acoustic tokenizers can generate discrete labels with rich audio semantics and our audio SSL models achieve state-of-the-art (SOTA) results across various audio classification benchmarks
Researcher Affiliation Collaboration 1Harbin Institute of Technology, Harbin, Heilongjiang, China 2Microsoft Research Asia, Beijing, China 3Nankai University, Tianjin, China 4Microsoft Corporation, Redmond, WA, USA.
Pseudocode No The paper describes the methodology in text and figures but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes The code and pre-trained models are available at https://aka.ms/beats.
Open Datasets Yes We pre-train our BEATS tokenizers and audio SSL models on the full training set of the Audio Set dataset (Gemmeke et al., 2017), and evaluate our pre-trained audio SSL models on six downstream tasks, including three audio classification tasks (AS-2M, AS-20K (Gemmeke et al., 2017) and ESC-50 (Piczak, 2015)) and three speech classification tasks (KS1, KS2 (Warden, 2018) and ER (Busso et al., 2008)).
Dataset Splits Yes Speech Commands V2 (KS2) (Warden, 2018) is a keyword spotting dataset that contains 105,829 1-second spoken word clips annotated with 35 common word classes. It is officially subdivided into the training, validation, and testing set that contains 84,843, 9,981, and 11,005 audio clips respectively. ... Environmental Sound Classification (ESC-50) (Piczak, 2015)... We follow the 5-fold cross-validation evaluation setting as the previous works... ...IEMOCAP (ER) (Busso et al., 2008)... we use the 5-fold cross-validation evaluation setting as SUPERB benchmark (wen Yang et al., 2021).
Hardware Specification Yes Each of the BEATS models is trained with 16 Tesla V100-SXM2-32GB GPUs for around 75 hours and the self-distilled tokenizer is trained with 8 Tesla V100-SXM2-32GB GPUs for around 45 hours.
Software Dependencies No Table 4 lists hyperparameters such as 'Optimizer Adam W' and mentions augmentation methods like 'Spec Aug (Park et al., 2019)' and 'Mixup (Zhang et al., 2017)', but it does not specify software names with version numbers (e.g., PyTorch 1.9, TensorFlow 2.x).
Experiment Setup Yes Table 4 shows the detailed hyperparameters that are used for BEATS acoustic tokenizer training, audio SSL model pre-training and fine-tuning, which are adapted from the previous works (Xu et al., 2022; Chen et al., 2022b; Peng et al., 2022). It lists: Optimizer Adam W, Weight decay 0.01, Learning Rate Schedule Linear Decay, Steps 400K, Warmup epochs 32K, Batch size (s) 5.6K, Peak learning rate 5e-4, Dropout 0.1, Spec Aug 0.3, Mixup 0.8.