VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer
Authors: Zineng Tang, Jaemin Cho, Hao Tan, Mohit Bansal
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, VIDLANKD achieves consistent improvements over text-only language models and vokenization models, on several downstream language understanding tasks including GLUE, SQu AD, and SWAG. We also demonstrate the improved world knowledge, physical reasoning, and temporal reasoning capabilities of our model by evaluating on the GLUE-diagnostics, PIQA, and TRACIE datasets. Lastly, we present comprehensive ablation studies as well as visualizations of the learned text-to-video grounding results of our teacher and student language models.1 |
| Researcher Affiliation | Academia | Zineng Tang Jaemin Cho Hao Tan Mohit Bansal UNC Chapel Hill {terran, jmincho, haotan, mbansal}@cs.unc.edu |
| Pseudocode | No | The paper describes methods through architectural diagrams (Figure 1, 2, 3) and textual explanations, but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and models: https://github.com/zinengtang/Vid Lan KD |
| Open Datasets | Yes | We use How To100M [54] for cross-modal pretraining of our teacher model (Sec. 3.2). ... we follow Tan and Bansal [68] to use English Wikipedia. For ablation studies (Sec. 5.2), we use Wiki103 [52], a widely used subset of English Wikipedia. |
| Dataset Splits | Yes | We reserve 10K samples of the How To100M dataset as validation data. We train the teacher model until it converges on validation data. For downstream tasks, we report the results on the validation sets. |
| Hardware Specification | Yes | We implement our models with Py Torch 1.5 [56] and train them with Nvidia Ge Force RTX 2080ti GPUs. For teacher pretraining, we use 4 GPUs for BERT12L/768H and BERT6L/512H models for 7 days and 2.5 days respectively. For knowledge distillation, we use 4 GPUs for BERT12L/768H and BERT6L/512H models for 10 days and 3 days respectively. |
| Software Dependencies | Yes | We implement our models with Py Torch 1.5 [56] |
| Experiment Setup | Yes | We use an Adam W [41] optimizer with learning rate 2e-4 and weight decay [50] of 0.01. We train 3 epochs with a learning rate of 1e-4 and a batch-size of 32 for all downstream tasks. We use hinge loss margin α = 1.0 for LCT (Eq. 3). For both student and teacher language models, following previous works [49; 17; 68], we truncate input text that is longer than 128 tokens. We truncate videos features that are longer than 512 frames. |