reproducibilityindex.ai

TVLT: Textless Vision-Language Transformer

Authors: Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate models on video-based and image-based vision-and-language tasks to compare the learned representation based on audio and text. Table 1 shows that TVLT outperforms the text-based counterpart in audio-to-video retrieval tasks when pretrained on either How To100M or YTT-S. We comprehensively analyze the efficiency of our model and show ablation studies over different training variants.
Researcher Affiliation	Academia	Zineng Tang Jaemin Cho Yixin Nie Mohit Bansal UNC Chapel Hill {terran, jmincho, yixin1, mbansal}@cs.unc.edu
Pseudocode	No	The paper describes the model architecture and pretraining objectives using text and diagrams (Figure 2), but it does not include any explicit pseudocode or algorithm blocks.
Open Source Code	Yes	Our code and checkpoints are available at: https://github.com/zinengtang/TVLT
Open Datasets	Yes	How To100M [52], YTTemporal180M [86], MSR-VTT [81], Youcook2 [90], Cross Task [92], CMU-MOSEI [84], Places-400k (The Places Audio Caption 400K Corpus) [25; 23; 24], VQAv1 [4], and VQAv2 [21].
Dataset Splits	Yes	Youcook2... has 9,586 training clips and 3,350 validation clips. We report the validation split results. Cross Task... has 17,840 training clips and 2,819 validation clips. We report the validation split results.
Hardware Specification	Yes	Pretraining takes 2 weeks with 4 NVIDIA RTX A6000 GPUs (each 49GB memory). We use 2 NVIDIA RTX A6000 GPUs.
Software Dependencies	No	The paper mentions software like 'Speechbrain package', 'Wave Net Google Cloud Text-to-Speech API', and 'librosa', but does not provide specific version numbers for these components in the experimental setup description.
Experiment Setup	Yes	We train TVLT and the text-based TVLT counterpart for 200k steps using Adam optimizer [33] with a learning rate of 1e-5, batch size 4096, and a decay rate of 0.001 with a cosine schedule [47]. For the pretraining objectives in Eq. (1), we use λVAM = 1.0 and λMAE = 0.3. Finetuning on Downstream Tasks. We use a learning rate of 1e-5, batch size 256, and a decay rate of 0.001 with a cosine schedule for all tasks.