reproducibilityindex.ai

EAT: Self-Supervised Pre-Training with Efficient Audio Transformer

Authors: Wenxi Chen, Yuzhe Liang, Ziyang Ma, Zhisheng Zheng, Xie Chen

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiment results demonstrate that EAT achieves state-of-the-art (SOTA) performance on a range of audio-related tasks, including Audio Set (AS-2M, AS-20K), ESC-50, and SPC-2, along with a significant pre-training speedup up to 15x compared to existing audio SSL models.
Researcher Affiliation	Academia	Mo E Key Lab of Artificial Intelligence, AI Institute, X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China {1029713857, chenxie95}@sjtu.edu.cn
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code and pre-trained models is available at https://github.com/cwx-worst-one/EAT.
Open Datasets	Yes	We pre-trained EAT on the Audio Set-2M (AS-2M) dataset [Gemmeke et al., 2017], evaluating its performance through audio-classification fine-tuning on AS-2M, AS-20K, and the Environmental Sound Classification (ESC-50) [Piczak, 2015] datasets, as well as speech-classification fine-tuning on the Speech Commands V2 (SPC-2) [Warden, 2018] dataset.
Dataset Splits	Yes	In our evaluation, we implemented a five-fold cross-validation method, using 400 clips for validation and the remaining for training in each fold. It includes 84,843 training recordings, 9,981 validation recordings, and 11,005 testing recordings, each lasting 1 second.
Hardware Specification	Yes	We distribute the training load over 4 RTX 3090 GPUs and the total training time is around 58 hours.
Software Dependencies	No	The paper mentions using Adam optimizer, cosine annealing, and Fairseq, but does not provide specific version numbers for these software components or other libraries like Python, PyTorch, or CUDA.
Experiment Setup	Yes	EAT is pre-trained using AS-2M for 10 epochs with a batch size of 12 and a peak learning rate of 0.0005. For each clip, we created 16 clones with different inverse block masks via the multi-mask method. The cosine annealing learning strategy with warm-up steps [Loshchilov and Hutter, 2016] was employed, alongside the Adam optimizer [Loshchilov and Hutter, 2017], with β1 and β2 values set to 0.9 and 0.95, respectively.