Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
TVLT: Textless Vision-Language Transformer
Authors: Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate models on video-based and image-based vision-and-language tasks to compare the learned representation based on audio and text. Table 1 shows that TVLT outperforms the text-based counterpart in audio-to-video retrieval tasks when pretrained on either How To100M or YTT-S. We comprehensively analyze the efficiency of our model and show ablation studies over different training variants. |
| Researcher Affiliation | Academia | Zineng Tang Jaemin Cho Yixin Nie Mohit Bansal UNC Chapel Hill EMAIL |
| Pseudocode | No | The paper describes the model architecture and pretraining objectives using text and diagrams (Figure 2), but it does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and checkpoints are available at: https://github.com/zinengtang/TVLT |
| Open Datasets | Yes | How To100M [52], YTTemporal180M [86], MSR-VTT [81], Youcook2 [90], Cross Task [92], CMU-MOSEI [84], Places-400k (The Places Audio Caption 400K Corpus) [25; 23; 24], VQAv1 [4], and VQAv2 [21]. |
| Dataset Splits | Yes | Youcook2... has 9,586 training clips and 3,350 validation clips. We report the validation split results. Cross Task... has 17,840 training clips and 2,819 validation clips. We report the validation split results. |
| Hardware Specification | Yes | Pretraining takes 2 weeks with 4 NVIDIA RTX A6000 GPUs (each 49GB memory). We use 2 NVIDIA RTX A6000 GPUs. |
| Software Dependencies | No | The paper mentions software like 'Speechbrain package', 'Wave Net Google Cloud Text-to-Speech API', and 'librosa', but does not provide specific version numbers for these components in the experimental setup description. |
| Experiment Setup | Yes | We train TVLT and the text-based TVLT counterpart for 200k steps using Adam optimizer [33] with a learning rate of 1e-5, batch size 4096, and a decay rate of 0.001 with a cosine schedule [47]. For the pretraining objectives in Eq. (1), we use λVAM = 1.0 and λMAE = 0.3. Finetuning on Downstream Tasks. We use a learning rate of 1e-5, batch size 256, and a decay rate of 0.001 with a cosine schedule for all tasks. |