MERLOT: Multimodal Neural Script Knowledge Models
Authors: Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, Yejin Choi
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our model on a diverse suite of video tasks, requiring both recognitionand cognition-level reasoning across long and short timescales; when finetuned, MERLOT achieves a new state-of-the-art on 12 such tasks. Additionally, we show that our script-knowledge representations transfer to the single image domain. On Visual Commonsense Reasoning (VCR; [123]), our model achieves particularly strong performance, outperforming models that require heavy visual supervision (in the form of object detection bounding boxes, or images paired with pristine captions). Beyond finetuning, we show both quantitatively and qualitatively that MERLOT has a strong outof-the-box understanding of everyday events and situations. Given a scrambled visual story, [50, 2], MERLOT can sort image sequences to match captions which tell a globally coherent narrative. Despite considerable domain shift from videos to static images, MERLOT outperforms strong baselines like CLIP [89] and UNITER [22], which independently match images to text and thus cannot reason over long-term contexts as effectively. This capacity for temporal coherence emerges during pretraining: analysis of MERLOT s attention patterns (Figure 11) show that regions attend to captions that are distant in time (and vice versa), allowing it perform cross-modal coreference to piece together a holistic view of situations. Finally, ablations of MERLOT show that 1) pretraining works better when we train on videos rather than still images, aided crucially by our strategy of corrupting highly visual words in the masked language modeling task, 2) using a diverse set of videos covering many aspects of everyday situations improves downstream performance compared to curated instructional video corpora [107, 80] which both cover a smaller slice of the visual world (confirming hypotheses from past work [47]); and 3) MERLOT s performance does not saturate even after many epochs of training on the pretraining corpus we curated, YT-Temporal-180M, as it continues to improve performance simply with more pretraining. The combination of these results suggests that learning full-stack visual reasoning and multimodal world knowledge from video data is a promising path forward for future research. |
| Researcher Affiliation | Collaboration | Rowan Zellers Ximing Lu Jack Hessel Youngjae Yu Jae Sung Park Jize Cao Ali Farhadi Yejin Choi Paul G. Allen School of Computer Science & Engineering, University of Washington Allen Institute for Artificial Intelligence https://rowanzellers.com/merlot |
| Pseudocode | No | The paper describes the architecture and pretraining tasks in text and diagrams but does not provide pseudocode or algorithm blocks. |
| Open Source Code | Yes | At rowanzellers.com/merlot, we have released code, data, and models for public research use. |
| Open Datasets | Yes | We collect YT-Temporal-180M, a dataset for learning multimodal script knowledge, derived from 6 million public You Tube videos. Our YT-Temporal-180M intentionally spans many domains, datasets, and topics. We began with 27 million candidate video IDs (which we then filtered), including instructional videos from How To100M [80], lifestyle vlogs of everyday events from the VLOG dataset [35], and You Tube s auto-suggested videos for popular topics like science or home improvement. |
| Dataset Splits | Yes | We pretrain our model for 40 epochs over our video dataset. We preprocess the dataset into examples with sequences of N=16 video segments each, each containing up to L=32 BPE tokens.3 The language-only encoder computes contrastive representations given this entire sequence, its total length is thus 512 tokens. To save memory, we provide the joint vision-language encoder 4 groups of N = 4 segments each. At an image training resolution of 192 352, the joint model s sequence length is 396 tokens. To combine the losses, we multiply the contrastive loss by a coefficient of 0.25, which we found scaled its gradient magnitudes to roughly the same magnitude as the Mask LM loss. We train the model using a v3-1024 TPU pod, at a batch size of 1024 sequences (or 16k segments) in total. This pretraining process on this hardware takes 30 hours. We provide additional information about hyperparameters and experimental setup in Appendix E.1. |
| Hardware Specification | Yes | We train the model using a v3-1024 TPU pod, at a batch size of 1024 sequences (or 16k segments) in total. This pretraining process on this hardware takes 30 hours. We provide additional information about hyperparameters and experimental setup in Appendix E.1. |
| Software Dependencies | No | The paper mentions using a Transformer [112] and initializing it with pretrained RoBERTa weights [72], but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We pretrain our model for 40 epochs over our video dataset. We preprocess the dataset into examples with sequences of N=16 video segments each, each containing up to L=32 BPE tokens.3 The language-only encoder computes contrastive representations given this entire sequence, its total length is thus 512 tokens. To save memory, we provide the joint vision-language encoder 4 groups of N = 4 segments each. At an image training resolution of 192 352, the joint model s sequence length is 396 tokens. To combine the losses, we multiply the contrastive loss by a coefficient of 0.25, which we found scaled its gradient magnitudes to roughly the same magnitude as the Mask LM loss. |