reproducibilityindex.ai

Masked Autoencoders that Listen

Authors: Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, Christoph Feichtenhofer

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This paper studies a simple extension of image-based Masked Autoencoders (MAE) [1] to self-supervised representation learning from audio spectrograms. Empirically, Audio-MAE sets new state-of-the-art performance on six audio and speech classification tasks, outperforming other recent models that use external supervised pre-training. Our code and models is available at https://github.com/facebookresearch/Audio MAE.
Researcher Affiliation	Collaboration	1Meta AI 2Carnegie Mellon University
Pseudocode	No	The paper describes the Audio-MAE architecture and process, including details on the encoder and decoder components, but does not provide any pseudocode or algorithm blocks.
Open Source Code	Yes	Our code and models is available at https://github.com/facebookresearch/Audio MAE.
Open Datasets	Yes	We perform an extensive evaluation on six tasks, including audio classification on Audio Set (AS-2M, AS-20K) and Environmental Sound Classification (ESC-50), and speech classification on Speech Commands (SPC-1 and SPC-2) and Vox Celeb (SID). We use Audio Set for ablation studies. Audio Set [12] (AS-2M, AS-20K)... Environmental Sound Classification (ESC-50) [13]... Speech Commands (SPC-2, SPC-1) [52]... Vox Celeb (SID) [54]...
Dataset Splits	Yes	For the AS-2M experiments, we use the union of unbalanced and balanced training audio for pretraining and fine-tuning. For the AS-20K experiments, we use AS-2M for pre-training and the 20K balanced set for fine-tuning. We report the testing m AP on the 19K eval set used by AST [10]. Environmental Sound Classification (ESC-50) [13] is an audio classification dataset consists of 2,000 5-second environmental sound recordings. There are 50 classes in ESC. We report accuracy under 5-fold cross-validation with the same split used by [10]. Speech Commands (SPC-2, SPC-1) [52]... The training/validation/testing set has 84,843/9,981/11,005 1-second recordings, respectively. Vox Celeb (SID) [54]... We use the V1 standard train (138,361), validation (6,904), testing (8,251) sets and report the testing accuracy.
Hardware Specification	Yes	We distribute the training load over 64 V100 GPUs and the total training time is 36 hours.
Software Dependencies	No	The paper mentions 'Kaldi [55]-compatible Mel-frequency bands' and implicitly uses standard deep learning frameworks, but it does not specify version numbers for any software dependencies, such as Python, PyTorch, or Kaldi.
Experiment Setup	Yes	We train for 32 epochs with a batch size of 512 and a 0.0002 learning rate. For each audio, we randomly sample the starting time, cyclically extract 10-second audio, and randomly jitter its magnitude by up to 6d B. By default, we use a masking ratio of 0.8 with (unstructured) random masking for pre-training. During fine-tuning, we employ a lower masking ratio (0.3 in time and 0.3 in frequency).