Masked Autoencoders that Listen
Authors: Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, Christoph Feichtenhofer
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper studies a simple extension of image-based Masked Autoencoders (MAE) [1] to self-supervised representation learning from audio spectrograms. Empirically, Audio-MAE sets new state-of-the-art performance on six audio and speech classification tasks, outperforming other recent models that use external supervised pre-training. Our code and models is available at https://github.com/facebookresearch/Audio MAE. |
| Researcher Affiliation | Collaboration | 1Meta AI 2Carnegie Mellon University |
| Pseudocode | No | The paper describes the Audio-MAE architecture and process, including details on the encoder and decoder components, but does not provide any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and models is available at https://github.com/facebookresearch/Audio MAE. |
| Open Datasets | Yes | We perform an extensive evaluation on six tasks, including audio classification on Audio Set (AS-2M, AS-20K) and Environmental Sound Classification (ESC-50), and speech classification on Speech Commands (SPC-1 and SPC-2) and Vox Celeb (SID). We use Audio Set for ablation studies. Audio Set [12] (AS-2M, AS-20K)... Environmental Sound Classification (ESC-50) [13]... Speech Commands (SPC-2, SPC-1) [52]... Vox Celeb (SID) [54]... |
| Dataset Splits | Yes | For the AS-2M experiments, we use the union of unbalanced and balanced training audio for pretraining and fine-tuning. For the AS-20K experiments, we use AS-2M for pre-training and the 20K balanced set for fine-tuning. We report the testing m AP on the 19K eval set used by AST [10]. Environmental Sound Classification (ESC-50) [13] is an audio classification dataset consists of 2,000 5-second environmental sound recordings. There are 50 classes in ESC. We report accuracy under 5-fold cross-validation with the same split used by [10]. Speech Commands (SPC-2, SPC-1) [52]... The training/validation/testing set has 84,843/9,981/11,005 1-second recordings, respectively. Vox Celeb (SID) [54]... We use the V1 standard train (138,361), validation (6,904), testing (8,251) sets and report the testing accuracy. |
| Hardware Specification | Yes | We distribute the training load over 64 V100 GPUs and the total training time is 36 hours. |
| Software Dependencies | No | The paper mentions 'Kaldi [55]-compatible Mel-frequency bands' and implicitly uses standard deep learning frameworks, but it does not specify version numbers for any software dependencies, such as Python, PyTorch, or Kaldi. |
| Experiment Setup | Yes | We train for 32 epochs with a batch size of 512 and a 0.0002 learning rate. For each audio, we randomly sample the starting time, cyclically extract 10-second audio, and randomly jitter its magnitude by up to 6d B. By default, we use a masking ratio of 0.8 with (unstructured) random masking for pre-training. During fine-tuning, we employ a lower masking ratio (0.3 in time and 0.3 in frequency). |