EAT: Self-Supervised Pre-Training with Efficient Audio Transformer

Authors: Wenxi Chen, Yuzhe Liang, Ziyang Ma, Zhisheng Zheng, Xie Chen

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiment results demonstrate that EAT achieves state-of-the-art (SOTA) performance on a range of audio-related tasks, including Audio Set (AS-2M, AS-20K), ESC-50, and SPC-2, along with a significant pre-training speedup up to 15x compared to existing audio SSL models.
Researcher Affiliation Academia Mo E Key Lab of Artificial Intelligence, AI Institute, X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China {1029713857, chenxie95}@sjtu.edu.cn
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The code and pre-trained models is available at https://github.com/cwx-worst-one/EAT.
Open Datasets Yes We pre-trained EAT on the Audio Set-2M (AS-2M) dataset [Gemmeke et al., 2017], evaluating its performance through audio-classification fine-tuning on AS-2M, AS-20K, and the Environmental Sound Classification (ESC-50) [Piczak, 2015] datasets, as well as speech-classification fine-tuning on the Speech Commands V2 (SPC-2) [Warden, 2018] dataset.
Dataset Splits Yes In our evaluation, we implemented a five-fold cross-validation method, using 400 clips for validation and the remaining for training in each fold. It includes 84,843 training recordings, 9,981 validation recordings, and 11,005 testing recordings, each lasting 1 second.
Hardware Specification Yes We distribute the training load over 4 RTX 3090 GPUs and the total training time is around 58 hours.
Software Dependencies No The paper mentions using Adam optimizer, cosine annealing, and Fairseq, but does not provide specific version numbers for these software components or other libraries like Python, PyTorch, or CUDA.
Experiment Setup Yes EAT is pre-trained using AS-2M for 10 epochs with a batch size of 12 and a peak learning rate of 0.0005. For each clip, we created 16 clones with different inverse block masks via the multi-mask method. The cosine annealing learning strategy with warm-up steps [Loshchilov and Hutter, 2016] was employed, alongside the Adam optimizer [Loshchilov and Hutter, 2017], with β1 and β2 values set to 0.9 and 0.95, respectively.