Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization
Authors: Junru Wu, Yi Liang, feng han, Hassan Akbari, Zhangyang Wang, Cong Yu
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Applying those gradient harmonization techniques to pre-training VATT on the How To100M dataset, we consistently improve its performance on different downstream tasks. Moreover, we are able to scale VATT pre-training to more complicated non-narrative Youtube8M dataset to further improve the state-of-the-arts. (Section 4: Experiments, and performance tables like Table 1) |
| Researcher Affiliation | Collaboration | Junru Wu Texas A&M University sandboxmaster@tamu.edu; Yi Liang Google Research yiliang@google.com; Feng Han Google Research bladehan@google.com; Hassan Akbari Google Research hassanak@google.com; Zhangyang Wang University of Texas at Austin atlaswang@utexas.edu; Cong Yu Celonis Inc. / Celo AI cong.yu@celonis.com |
| Pseudocode | Yes | Algorithm 1 Cross-Modality Gradient Realignment; Algorithm 2 Gradient-based Curriculum Learning |
| Open Source Code | No | The paper does not include any statement about releasing source code or provide a link to a code repository for its methodology. |
| Open Datasets | Yes | How To100M1[12] is a large-scale dataset of narrated videos... Audio Set1[24] is a large-scale audio-visual dataset... Youtube8M1[1] is a large-scale video classification dataset... |
| Dataset Splits | No | The paper mentions using subsets of datasets and sampling clips but does not provide specific numerical train/validation/test splits (e.g., percentages or exact counts) to reproduce the data partitioning. |
| Hardware Specification | Yes | Our framework is implemented in Tensorflow 2.8, and train with 256 TPUV3s, it took a total of 3 days to train our models. |
| Software Dependencies | Yes | Our framework is implemented in Tensorflow 2.8 |
| Experiment Setup | Yes | Pre-training Hyperparameter: We strictly follow the setting in [11], pre-training VATT from scratch with Adam optimizer with an initial learning rate of 1e-4, 10k warmup steps, 500k steps in total, a batch size of 2048 and using a cosine learning rate scheduler to anneal the learning rate from 1e-4 to 5e-5. |