Masked Autoencoders that Listen

Authors: Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, Christoph Feichtenhofer

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper studies a simple extension of image-based Masked Autoencoders (MAE) [1] to self-supervised representation learning from audio spectrograms. Empirically, Audio-MAE sets new state-of-the-art performance on six audio and speech classification tasks, outperforming other recent models that use external supervised pre-training. Our code and models is available at https://github.com/facebookresearch/Audio MAE.
Researcher Affiliation Collaboration 1Meta AI 2Carnegie Mellon University
Pseudocode No The paper describes the Audio-MAE architecture and process, including details on the encoder and decoder components, but does not provide any pseudocode or algorithm blocks.
Open Source Code Yes Our code and models is available at https://github.com/facebookresearch/Audio MAE.
Open Datasets Yes We perform an extensive evaluation on six tasks, including audio classification on Audio Set (AS-2M, AS-20K) and Environmental Sound Classification (ESC-50), and speech classification on Speech Commands (SPC-1 and SPC-2) and Vox Celeb (SID). We use Audio Set for ablation studies. Audio Set [12] (AS-2M, AS-20K)... Environmental Sound Classification (ESC-50) [13]... Speech Commands (SPC-2, SPC-1) [52]... Vox Celeb (SID) [54]...
Dataset Splits Yes For the AS-2M experiments, we use the union of unbalanced and balanced training audio for pretraining and fine-tuning. For the AS-20K experiments, we use AS-2M for pre-training and the 20K balanced set for fine-tuning. We report the testing m AP on the 19K eval set used by AST [10]. Environmental Sound Classification (ESC-50) [13] is an audio classification dataset consists of 2,000 5-second environmental sound recordings. There are 50 classes in ESC. We report accuracy under 5-fold cross-validation with the same split used by [10]. Speech Commands (SPC-2, SPC-1) [52]... The training/validation/testing set has 84,843/9,981/11,005 1-second recordings, respectively. Vox Celeb (SID) [54]... We use the V1 standard train (138,361), validation (6,904), testing (8,251) sets and report the testing accuracy.
Hardware Specification Yes We distribute the training load over 64 V100 GPUs and the total training time is 36 hours.
Software Dependencies No The paper mentions 'Kaldi [55]-compatible Mel-frequency bands' and implicitly uses standard deep learning frameworks, but it does not specify version numbers for any software dependencies, such as Python, PyTorch, or Kaldi.
Experiment Setup Yes We train for 32 epochs with a batch size of 512 and a 0.0002 learning rate. For each audio, we randomly sample the starting time, cyclically extract 10-second audio, and randomly jitter its magnitude by up to 6d B. By default, we use a masking ratio of 0.8 with (unstructured) random masking for pre-training. During fine-tuning, we employ a lower masking ratio (0.3 in time and 0.3 in frequency).