reproducibilityindex.ai

BEATs: Audio Pre-Training with Acoustic Tokenizers

Authors: Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, Wanxiang Che, Xiangzhan Yu, Furu Wei

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experimental results demonstrate our acoustic tokenizers can generate discrete labels with rich audio semantics and our audio SSL models achieve state-of-the-art (SOTA) results across various audio classification benchmarks
Researcher Affiliation	Collaboration	1Harbin Institute of Technology, Harbin, Heilongjiang, China 2Microsoft Research Asia, Beijing, China 3Nankai University, Tianjin, China 4Microsoft Corporation, Redmond, WA, USA.
Pseudocode	No	The paper describes the methodology in text and figures but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code and pre-trained models are available at https://aka.ms/beats.
Open Datasets	Yes	We pre-train our BEATS tokenizers and audio SSL models on the full training set of the Audio Set dataset (Gemmeke et al., 2017), and evaluate our pre-trained audio SSL models on six downstream tasks, including three audio classification tasks (AS-2M, AS-20K (Gemmeke et al., 2017) and ESC-50 (Piczak, 2015)) and three speech classification tasks (KS1, KS2 (Warden, 2018) and ER (Busso et al., 2008)).
Dataset Splits	Yes	Speech Commands V2 (KS2) (Warden, 2018) is a keyword spotting dataset that contains 105,829 1-second spoken word clips annotated with 35 common word classes. It is officially subdivided into the training, validation, and testing set that contains 84,843, 9,981, and 11,005 audio clips respectively. ... Environmental Sound Classification (ESC-50) (Piczak, 2015)... We follow the 5-fold cross-validation evaluation setting as the previous works... ...IEMOCAP (ER) (Busso et al., 2008)... we use the 5-fold cross-validation evaluation setting as SUPERB benchmark (wen Yang et al., 2021).
Hardware Specification	Yes	Each of the BEATS models is trained with 16 Tesla V100-SXM2-32GB GPUs for around 75 hours and the self-distilled tokenizer is trained with 8 Tesla V100-SXM2-32GB GPUs for around 45 hours.
Software Dependencies	No	Table 4 lists hyperparameters such as 'Optimizer Adam W' and mentions augmentation methods like 'Spec Aug (Park et al., 2019)' and 'Mixup (Zhang et al., 2017)', but it does not specify software names with version numbers (e.g., PyTorch 1.9, TensorFlow 2.x).
Experiment Setup	Yes	Table 4 shows the detailed hyperparameters that are used for BEATS acoustic tokenizer training, audio SSL model pre-training and fine-tuning, which are adapted from the previous works (Xu et al., 2022; Chen et al., 2022b; Peng et al., 2022). It lists: Optimizer Adam W, Weight decay 0.01, Learning Rate Schedule Linear Decay, Steps 400K, Warmup epochs 32K, Batch size (s) 5.6K, Peak learning rate 5e-4, Dropout 0.1, Spec Aug 0.3, Mixup 0.8.