TVLT: Textless Vision-Language Transformer
Authors: Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate models on video-based and image-based vision-and-language tasks to compare the learned representation based on audio and text. Table 1 shows that TVLT outperforms the text-based counterpart in audio-to-video retrieval tasks when pretrained on either How To100M or YTT-S. We comprehensively analyze the efficiency of our model and show ablation studies over different training variants. |
| Researcher Affiliation | Academia | Zineng Tang Jaemin Cho Yixin Nie Mohit Bansal UNC Chapel Hill {terran, jmincho, yixin1, mbansal}@cs.unc.edu |
| Pseudocode | No | The paper describes the model architecture and pretraining objectives using text and diagrams (Figure 2), but it does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and checkpoints are available at: https://github.com/zinengtang/TVLT |
| Open Datasets | Yes | How To100M [52], YTTemporal180M [86], MSR-VTT [81], Youcook2 [90], Cross Task [92], CMU-MOSEI [84], Places-400k (The Places Audio Caption 400K Corpus) [25; 23; 24], VQAv1 [4], and VQAv2 [21]. |
| Dataset Splits | Yes | Youcook2... has 9,586 training clips and 3,350 validation clips. We report the validation split results. Cross Task... has 17,840 training clips and 2,819 validation clips. We report the validation split results. |
| Hardware Specification | Yes | Pretraining takes 2 weeks with 4 NVIDIA RTX A6000 GPUs (each 49GB memory). We use 2 NVIDIA RTX A6000 GPUs. |
| Software Dependencies | No | The paper mentions software like 'Speechbrain package', 'Wave Net Google Cloud Text-to-Speech API', and 'librosa', but does not provide specific version numbers for these components in the experimental setup description. |
| Experiment Setup | Yes | We train TVLT and the text-based TVLT counterpart for 200k steps using Adam optimizer [33] with a learning rate of 1e-5, batch size 4096, and a decay rate of 0.001 with a cosine schedule [47]. For the pretraining objectives in Eq. (1), we use λVAM = 1.0 and λMAE = 0.3. Finetuning on Downstream Tasks. We use a learning rate of 1e-5, batch size 256, and a decay rate of 0.001 with a cosine schedule for all tasks. |