SSAST: Self-Supervised Audio Spectrogram Transformer
Authors: Yuan Gong, Cheng-I Lai, Yu-An Chung, James Glass10699-10709
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our pretrained models on both audio and speech classification tasks including audio event classification, keyword spotting, emotion recognition, and speaker identification. The proposed self-supervised framework significantly boosts AST performance on all tasks, with an average improvement of 60.9%, leading to similar or even better results than a supervised pretrained AST. |
| Researcher Affiliation | Academia | MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139 {yuangong, clai24, andyyuan, glass}@mit.edu |
| Pseudocode | Yes | Algorithm 1: Joint Discriminative and Generative Masked Spectrogram Patch Modeling |
| Open Source Code | Yes | *Code and models at https://github.com/Yuan Gong ND/ssast |
| Open Datasets | Yes | We use the Audio Set full training set (Audio Set-2M) (Gemmeke et al. 2017) as our audio pretraining dataset. For both datasets, we only use the audio data and abandon the labels for self-supervised pretraining. |
| Dataset Splits | Yes | For pretraining the AST, we use a batch size of 24, an initial learning rate of 1e-4, and cut the learning rate into half if the pretext task performance on the validation set stops improving for 8k iterations. |
| Hardware Specification | Yes | We pretrain SSAST on 4 NVIDIA GTX Titan X or GTX Titan X Pascal GPUs, the pretraining takes about 10 days. |
| Software Dependencies | No | The paper mentions software components like 'Adam optimizer' and implicitly deep learning frameworks but does not specify software dependencies with version numbers. |
| Experiment Setup | Yes | For pretraining the AST, we use a batch size of 24, an initial learning rate of 1e-4, and cut the learning rate into half if the pretext task performance on the validation set stops improving for 8k iterations. We optimize the network using the Adam optimizer (Kingma and Ba 2015). We train the model for up to 800k iterations (≈ 8.5 epochs). |