EAT: Self-Supervised Pre-Training with Efficient Audio Transformer
Authors: Wenxi Chen, Yuzhe Liang, Ziyang Ma, Zhisheng Zheng, Xie Chen
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiment results demonstrate that EAT achieves state-of-the-art (SOTA) performance on a range of audio-related tasks, including Audio Set (AS-2M, AS-20K), ESC-50, and SPC-2, along with a significant pre-training speedup up to 15x compared to existing audio SSL models. |
| Researcher Affiliation | Academia | Mo E Key Lab of Artificial Intelligence, AI Institute, X-LANCE Lab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China {1029713857, chenxie95}@sjtu.edu.cn |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code and pre-trained models is available at https://github.com/cwx-worst-one/EAT. |
| Open Datasets | Yes | We pre-trained EAT on the Audio Set-2M (AS-2M) dataset [Gemmeke et al., 2017], evaluating its performance through audio-classification fine-tuning on AS-2M, AS-20K, and the Environmental Sound Classification (ESC-50) [Piczak, 2015] datasets, as well as speech-classification fine-tuning on the Speech Commands V2 (SPC-2) [Warden, 2018] dataset. |
| Dataset Splits | Yes | In our evaluation, we implemented a five-fold cross-validation method, using 400 clips for validation and the remaining for training in each fold. It includes 84,843 training recordings, 9,981 validation recordings, and 11,005 testing recordings, each lasting 1 second. |
| Hardware Specification | Yes | We distribute the training load over 4 RTX 3090 GPUs and the total training time is around 58 hours. |
| Software Dependencies | No | The paper mentions using Adam optimizer, cosine annealing, and Fairseq, but does not provide specific version numbers for these software components or other libraries like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | EAT is pre-trained using AS-2M for 10 epochs with a batch size of 12 and a peak learning rate of 0.0005. For each clip, we created 16 clones with different inverse block masks via the multi-mask method. The cosine annealing learning strategy with warm-up steps [Loshchilov and Hutter, 2016] was employed, alongside the Adam optimizer [Loshchilov and Hutter, 2017], with β1 and β2 values set to 0.9 and 0.95, respectively. |