TVLT: Textless Vision-Language Transformer

Authors: Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate models on video-based and image-based vision-and-language tasks to compare the learned representation based on audio and text. Table 1 shows that TVLT outperforms the text-based counterpart in audio-to-video retrieval tasks when pretrained on either How To100M or YTT-S. We comprehensively analyze the efficiency of our model and show ablation studies over different training variants.
Researcher Affiliation Academia Zineng Tang Jaemin Cho Yixin Nie Mohit Bansal UNC Chapel Hill {terran, jmincho, yixin1, mbansal}@cs.unc.edu
Pseudocode No The paper describes the model architecture and pretraining objectives using text and diagrams (Figure 2), but it does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes Our code and checkpoints are available at: https://github.com/zinengtang/TVLT
Open Datasets Yes How To100M [52], YTTemporal180M [86], MSR-VTT [81], Youcook2 [90], Cross Task [92], CMU-MOSEI [84], Places-400k (The Places Audio Caption 400K Corpus) [25; 23; 24], VQAv1 [4], and VQAv2 [21].
Dataset Splits Yes Youcook2... has 9,586 training clips and 3,350 validation clips. We report the validation split results. Cross Task... has 17,840 training clips and 2,819 validation clips. We report the validation split results.
Hardware Specification Yes Pretraining takes 2 weeks with 4 NVIDIA RTX A6000 GPUs (each 49GB memory). We use 2 NVIDIA RTX A6000 GPUs.
Software Dependencies No The paper mentions software like 'Speechbrain package', 'Wave Net Google Cloud Text-to-Speech API', and 'librosa', but does not provide specific version numbers for these components in the experimental setup description.
Experiment Setup Yes We train TVLT and the text-based TVLT counterpart for 200k steps using Adam optimizer [33] with a learning rate of 1e-5, batch size 4096, and a decay rate of 0.001 with a cosine schedule [47]. For the pretraining objectives in Eq. (1), we use λVAM = 1.0 and λMAE = 0.3. Finetuning on Downstream Tasks. We use a learning rate of 1e-5, batch size 256, and a decay rate of 0.001 with a cosine schedule for all tasks.