Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization
Authors: Junru Wu, Yi Liang, feng han, Hassan Akbari, Zhangyang Wang, Cong Yu
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Applying those gradient harmonization techniques to pre-training VATT on the How To100M dataset, we consistently improve its performance on different downstream tasks. Moreover, we are able to scale VATT pre-training to more complicated non-narrative Youtube8M dataset to further improve the state-of-the-arts. (Section 4: Experiments, and performance tables like Table 1) |
| Researcher Affiliation | Collaboration | Junru Wu Texas A&M University EMAIL; Yi Liang Google Research EMAIL; Feng Han Google Research EMAIL; Hassan Akbari Google Research EMAIL; Zhangyang Wang University of Texas at Austin EMAIL; Cong Yu Celonis Inc. / Celo AI EMAIL |
| Pseudocode | Yes | Algorithm 1 Cross-Modality Gradient Realignment; Algorithm 2 Gradient-based Curriculum Learning |
| Open Source Code | No | The paper does not include any statement about releasing source code or provide a link to a code repository for its methodology. |
| Open Datasets | Yes | How To100M1[12] is a large-scale dataset of narrated videos... Audio Set1[24] is a large-scale audio-visual dataset... Youtube8M1[1] is a large-scale video classification dataset... |
| Dataset Splits | No | The paper mentions using subsets of datasets and sampling clips but does not provide specific numerical train/validation/test splits (e.g., percentages or exact counts) to reproduce the data partitioning. |
| Hardware Specification | Yes | Our framework is implemented in Tensorflow 2.8, and train with 256 TPUV3s, it took a total of 3 days to train our models. |
| Software Dependencies | Yes | Our framework is implemented in Tensorflow 2.8 |
| Experiment Setup | Yes | Pre-training Hyperparameter: We strictly follow the setting in [11], pre-training VATT from scratch with Adam optimizer with an initial learning rate of 1e-4, 10k warmup steps, 500k steps in total, a batch size of 2048 and using a cosine learning rate scheduler to anneal the learning rate from 1e-4 to 5e-5. |