reproducibilityindex.ai

VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer

Authors: Zineng Tang, Jaemin Cho, Hao Tan, Mohit Bansal

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments, VIDLANKD achieves consistent improvements over text-only language models and vokenization models, on several downstream language understanding tasks including GLUE, SQu AD, and SWAG. We also demonstrate the improved world knowledge, physical reasoning, and temporal reasoning capabilities of our model by evaluating on the GLUE-diagnostics, PIQA, and TRACIE datasets. Lastly, we present comprehensive ablation studies as well as visualizations of the learned text-to-video grounding results of our teacher and student language models.1
Researcher Affiliation	Academia	Zineng Tang Jaemin Cho Hao Tan Mohit Bansal UNC Chapel Hill {terran, jmincho, haotan, mbansal}@cs.unc.edu
Pseudocode	No	The paper describes methods through architectural diagrams (Figure 1, 2, 3) and textual explanations, but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code and models: https://github.com/zinengtang/Vid Lan KD
Open Datasets	Yes	We use How To100M [54] for cross-modal pretraining of our teacher model (Sec. 3.2). ... we follow Tan and Bansal [68] to use English Wikipedia. For ablation studies (Sec. 5.2), we use Wiki103 [52], a widely used subset of English Wikipedia.
Dataset Splits	Yes	We reserve 10K samples of the How To100M dataset as validation data. We train the teacher model until it converges on validation data. For downstream tasks, we report the results on the validation sets.
Hardware Specification	Yes	We implement our models with Py Torch 1.5 [56] and train them with Nvidia Ge Force RTX 2080ti GPUs. For teacher pretraining, we use 4 GPUs for BERT12L/768H and BERT6L/512H models for 7 days and 2.5 days respectively. For knowledge distillation, we use 4 GPUs for BERT12L/768H and BERT6L/512H models for 10 days and 3 days respectively.
Software Dependencies	Yes	We implement our models with Py Torch 1.5 [56]
Experiment Setup	Yes	We use an Adam W [41] optimizer with learning rate 2e-4 and weight decay [50] of 0.01. We train 3 epochs with a learning rate of 1e-4 and a batch-size of 32 for all downstream tasks. We use hinge loss margin α = 1.0 for LCT (Eq. 3). For both student and teacher language models, following previous works [49; 17; 68], we truncate input text that is longer than 128 tokens. We truncate videos features that are longer than 512 frames.