Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization

Authors: Junru Wu, Yi Liang, feng han, Hassan Akbari, Zhangyang Wang, Cong Yu

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Applying those gradient harmonization techniques to pre-training VATT on the How To100M dataset, we consistently improve its performance on different downstream tasks. Moreover, we are able to scale VATT pre-training to more complicated non-narrative Youtube8M dataset to further improve the state-of-the-arts. (Section 4: Experiments, and performance tables like Table 1)
Researcher Affiliation Collaboration Junru Wu Texas A&M University sandboxmaster@tamu.edu; Yi Liang Google Research yiliang@google.com; Feng Han Google Research bladehan@google.com; Hassan Akbari Google Research hassanak@google.com; Zhangyang Wang University of Texas at Austin atlaswang@utexas.edu; Cong Yu Celonis Inc. / Celo AI cong.yu@celonis.com
Pseudocode Yes Algorithm 1 Cross-Modality Gradient Realignment; Algorithm 2 Gradient-based Curriculum Learning
Open Source Code No The paper does not include any statement about releasing source code or provide a link to a code repository for its methodology.
Open Datasets Yes How To100M1[12] is a large-scale dataset of narrated videos... Audio Set1[24] is a large-scale audio-visual dataset... Youtube8M1[1] is a large-scale video classification dataset...
Dataset Splits No The paper mentions using subsets of datasets and sampling clips but does not provide specific numerical train/validation/test splits (e.g., percentages or exact counts) to reproduce the data partitioning.
Hardware Specification Yes Our framework is implemented in Tensorflow 2.8, and train with 256 TPUV3s, it took a total of 3 days to train our models.
Software Dependencies Yes Our framework is implemented in Tensorflow 2.8
Experiment Setup Yes Pre-training Hyperparameter: We strictly follow the setting in [11], pre-training VATT from scratch with Adam optimizer with an initial learning rate of 1e-4, 10k warmup steps, 500k steps in total, a batch size of 2048 and using a cosine learning rate scheduler to anneal the learning rate from 1e-4 to 5e-5.