MAViL: Masked Audio-Video Learners
Authors: Po-Yao Huang, Vasu Sharma, Hu Xu, Chaitanya Ryali, haoqi fan, Yanghao Li, Shang-Wen Li, Gargi Ghosh, Jitendra Malik, Christoph Feichtenhofer
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, MAVi L achieves state-of-the-art audio-video classification performance on Audio Set (53.3 m AP) and VGGSound (67.1% accuracy), surpassing recent self-supervised models and supervised models that utilize external labeled data.4 Experiments We performed comprehensive evaluations, including audio-video classification tasks on Audio Set [28] (AS-2M and AS-20K), and VGGSound [11]. Also, we conducted audio-to-video retrieval experiments on MSR-VTT [96] and You Cook [100]. We use AS-20K for model analysis and ablation studies. |
| Researcher Affiliation | Collaboration | Po-Yao Huang1 Vasu Sharma1 Hu Xu1 Chaitanya Ryali1 Haoqi Fan1 Yanghao Li1 Shang-Wen Li1 Gargi Ghosh1 Jitendra Malik1,2 Christoph Feichtenhofer1 1FAIR, Meta 2University of California, Berkeley |
| Pseudocode | No | The paper does not include any explicitly labeled pseudocode or algorithm blocks. The methods are described in text and illustrated with diagrams (Fig. 1, Fig. 2). |
| Open Source Code | Yes | The code and models are available at https://github.com/facebookresearch/MAVi L. |
| Open Datasets | Yes | We performed comprehensive evaluations, including audio-video classification tasks on Audio Set [28] (AS-2M and AS-20K), and VGGSound [11]. Also, we conducted audio-to-video retrieval experiments on MSR-VTT [96] and You Cook [100]. |
| Dataset Splits | No | The paper specifies training and testing sets, for example: "Audio Set: The eval set has 20K clips. We use the full (unbalanced+balanced) training set for pre-training. In the AS-2M task, we fine-tune on the full training set. In the AS-20K task, we fine-tune only on the 20K balanced training set. We report the classification m AP on the 19K eval set used by AST [30]." and "VGGSound is divided into 183K training and 15K testing samples." However, it does not explicitly describe a separate validation dataset split with specific numbers or percentages for hyperparameter tuning, instead referring to an "eval set" which acts as the test set. |
| Hardware Specification | Yes | We pre-train with 64 V100 GPUs with a 512 accumulated batch size and a 0.0002 learning rate. |
| Software Dependencies | No | The paper mentions tools like "Adam W" and "Kaldi", and refers to general model architectures like "Transformers", but it does not provide specific version numbers for software dependencies or libraries (e.g., "PyTorch 1.9" or "Python 3.8"). |
| Experiment Setup | Yes | We pre-train MAVi L on AS-2M without using any of AS-2M labels. We use 80% masking for audio and video. For balancing the losses in Eq.(6), we set α = 0.1, τ inter c = 0.1 and β = 0.01, τ intra c = 1.0. These hyperparameters scale the gradients from the three losses into a comparable range to improve training stability. We pre-train with 64 V100 GPUs with a 512 accumulated batch size and a 0.0002 learning rate. We pre-train for 20 epochs in stage-1 and in each iteration of stage-2 (for (K = 3) iterations). The hyper-parameters are summarized in Table 9. |