SSAST: Self-Supervised Audio Spectrogram Transformer

Authors: Yuan Gong, Cheng-I Lai, Yu-An Chung, James Glass10699-10709

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our pretrained models on both audio and speech classification tasks including audio event classification, keyword spotting, emotion recognition, and speaker identification. The proposed self-supervised framework significantly boosts AST performance on all tasks, with an average improvement of 60.9%, leading to similar or even better results than a supervised pretrained AST.
Researcher Affiliation Academia MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139 {yuangong, clai24, andyyuan, glass}@mit.edu
Pseudocode Yes Algorithm 1: Joint Discriminative and Generative Masked Spectrogram Patch Modeling
Open Source Code Yes *Code and models at https://github.com/Yuan Gong ND/ssast
Open Datasets Yes We use the Audio Set full training set (Audio Set-2M) (Gemmeke et al. 2017) as our audio pretraining dataset. For both datasets, we only use the audio data and abandon the labels for self-supervised pretraining.
Dataset Splits Yes For pretraining the AST, we use a batch size of 24, an initial learning rate of 1e-4, and cut the learning rate into half if the pretext task performance on the validation set stops improving for 8k iterations.
Hardware Specification Yes We pretrain SSAST on 4 NVIDIA GTX Titan X or GTX Titan X Pascal GPUs, the pretraining takes about 10 days.
Software Dependencies No The paper mentions software components like 'Adam optimizer' and implicitly deep learning frameworks but does not specify software dependencies with version numbers.
Experiment Setup Yes For pretraining the AST, we use a batch size of 24, an initial learning rate of 1e-4, and cut the learning rate into half if the pretext task performance on the validation set stops improving for 8k iterations. We optimize the network using the Adam optimizer (Kingma and Ba 2015). We train the model for up to 800k iterations (≈ 8.5 epochs).