reproducibilityindex.ai

MAViL: Masked Audio-Video Learners

Authors: Po-Yao Huang, Vasu Sharma, Hu Xu, Chaitanya Ryali, haoqi fan, Yanghao Li, Shang-Wen Li, Gargi Ghosh, Jitendra Malik, Christoph Feichtenhofer

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, MAVi L achieves state-of-the-art audio-video classification performance on Audio Set (53.3 m AP) and VGGSound (67.1% accuracy), surpassing recent self-supervised models and supervised models that utilize external labeled data.4 Experiments We performed comprehensive evaluations, including audio-video classification tasks on Audio Set [28] (AS-2M and AS-20K), and VGGSound [11]. Also, we conducted audio-to-video retrieval experiments on MSR-VTT [96] and You Cook [100]. We use AS-20K for model analysis and ablation studies.
Researcher Affiliation	Collaboration	Po-Yao Huang1 Vasu Sharma1 Hu Xu1 Chaitanya Ryali1 Haoqi Fan1 Yanghao Li1 Shang-Wen Li1 Gargi Ghosh1 Jitendra Malik1,2 Christoph Feichtenhofer1 1FAIR, Meta 2University of California, Berkeley
Pseudocode	No	The paper does not include any explicitly labeled pseudocode or algorithm blocks. The methods are described in text and illustrated with diagrams (Fig. 1, Fig. 2).
Open Source Code	Yes	The code and models are available at https://github.com/facebookresearch/MAVi L.
Open Datasets	Yes	We performed comprehensive evaluations, including audio-video classification tasks on Audio Set [28] (AS-2M and AS-20K), and VGGSound [11]. Also, we conducted audio-to-video retrieval experiments on MSR-VTT [96] and You Cook [100].
Dataset Splits	No	The paper specifies training and testing sets, for example: "Audio Set: The eval set has 20K clips. We use the full (unbalanced+balanced) training set for pre-training. In the AS-2M task, we fine-tune on the full training set. In the AS-20K task, we fine-tune only on the 20K balanced training set. We report the classification m AP on the 19K eval set used by AST [30]." and "VGGSound is divided into 183K training and 15K testing samples." However, it does not explicitly describe a separate validation dataset split with specific numbers or percentages for hyperparameter tuning, instead referring to an "eval set" which acts as the test set.
Hardware Specification	Yes	We pre-train with 64 V100 GPUs with a 512 accumulated batch size and a 0.0002 learning rate.
Software Dependencies	No	The paper mentions tools like "Adam W" and "Kaldi", and refers to general model architectures like "Transformers", but it does not provide specific version numbers for software dependencies or libraries (e.g., "PyTorch 1.9" or "Python 3.8").
Experiment Setup	Yes	We pre-train MAVi L on AS-2M without using any of AS-2M labels. We use 80% masking for audio and video. For balancing the losses in Eq.(6), we set α = 0.1, τ inter c = 0.1 and β = 0.01, τ intra c = 1.0. These hyperparameters scale the gradients from the three losses into a comparable range to improve training stability. We pre-train with 64 V100 GPUs with a 512 accumulated batch size and a 0.0002 learning rate. We pre-train for 20 epochs in stage-1 and in each iteration of stage-2 (for (K = 3) iterations). The hyper-parameters are summarized in Table 9.