reproducibilityindex.ai

SSAST: Self-Supervised Audio Spectrogram Transformer

Authors: Yuan Gong, Cheng-I Lai, Yu-An Chung, James Glass10699-10709

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our pretrained models on both audio and speech classification tasks including audio event classification, keyword spotting, emotion recognition, and speaker identification. The proposed self-supervised framework significantly boosts AST performance on all tasks, with an average improvement of 60.9%, leading to similar or even better results than a supervised pretrained AST.
Researcher Affiliation	Academia	MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139 {yuangong, clai24, andyyuan, glass}@mit.edu
Pseudocode	Yes	Algorithm 1: Joint Discriminative and Generative Masked Spectrogram Patch Modeling
Open Source Code	Yes	*Code and models at https://github.com/Yuan Gong ND/ssast
Open Datasets	Yes	We use the Audio Set full training set (Audio Set-2M) (Gemmeke et al. 2017) as our audio pretraining dataset. For both datasets, we only use the audio data and abandon the labels for self-supervised pretraining.
Dataset Splits	Yes	For pretraining the AST, we use a batch size of 24, an initial learning rate of 1e-4, and cut the learning rate into half if the pretext task performance on the validation set stops improving for 8k iterations.
Hardware Specification	Yes	We pretrain SSAST on 4 NVIDIA GTX Titan X or GTX Titan X Pascal GPUs, the pretraining takes about 10 days.
Software Dependencies	No	The paper mentions software components like 'Adam optimizer' and implicitly deep learning frameworks but does not specify software dependencies with version numbers.
Experiment Setup	Yes	For pretraining the AST, we use a batch size of 24, an initial learning rate of 1e-4, and cut the learning rate into half if the pretext task performance on the validation set stops improving for 8k iterations. We optimize the network using the Adam optimizer (Kingma and Ba 2015). We train the model for up to 800k iterations (≈ 8.5 epochs).