VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer

Authors: Zineng Tang, Jaemin Cho, Hao Tan, Mohit Bansal

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, VIDLANKD achieves consistent improvements over text-only language models and vokenization models, on several downstream language understanding tasks including GLUE, SQu AD, and SWAG. We also demonstrate the improved world knowledge, physical reasoning, and temporal reasoning capabilities of our model by evaluating on the GLUE-diagnostics, PIQA, and TRACIE datasets. Lastly, we present comprehensive ablation studies as well as visualizations of the learned text-to-video grounding results of our teacher and student language models.1
Researcher Affiliation Academia Zineng Tang Jaemin Cho Hao Tan Mohit Bansal UNC Chapel Hill {terran, jmincho, haotan, mbansal}@cs.unc.edu
Pseudocode No The paper describes methods through architectural diagrams (Figure 1, 2, 3) and textual explanations, but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Code and models: https://github.com/zinengtang/Vid Lan KD
Open Datasets Yes We use How To100M [54] for cross-modal pretraining of our teacher model (Sec. 3.2). ... we follow Tan and Bansal [68] to use English Wikipedia. For ablation studies (Sec. 5.2), we use Wiki103 [52], a widely used subset of English Wikipedia.
Dataset Splits Yes We reserve 10K samples of the How To100M dataset as validation data. We train the teacher model until it converges on validation data. For downstream tasks, we report the results on the validation sets.
Hardware Specification Yes We implement our models with Py Torch 1.5 [56] and train them with Nvidia Ge Force RTX 2080ti GPUs. For teacher pretraining, we use 4 GPUs for BERT12L/768H and BERT6L/512H models for 7 days and 2.5 days respectively. For knowledge distillation, we use 4 GPUs for BERT12L/768H and BERT6L/512H models for 10 days and 3 days respectively.
Software Dependencies Yes We implement our models with Py Torch 1.5 [56]
Experiment Setup Yes We use an Adam W [41] optimizer with learning rate 2e-4 and weight decay [50] of 0.01. We train 3 epochs with a learning rate of 1e-4 and a batch-size of 32 for all downstream tasks. We use hinge loss margin α = 1.0 for LCT (Eq. 3). For both student and teacher language models, following previous works [49; 17; 68], we truncate input text that is longer than 128 tokens. We truncate videos features that are longer than 512 frames.