Auto-Encoding Morph-Tokens for Multimodal LLM

Authors: Kaihang Pan, Siliang Tang, Juncheng Li, Zhaoyu Fan, Wei Chow, Shuicheng Yan, Tat-Seng Chua, Yueting Zhuang, Hanwang Zhang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that morph-tokens can achieve a new SOTA for multimodal comprehension and generation simultaneously.
Researcher Affiliation Collaboration 1Zhejiang University 2National University of Singapore 3Skywork AI 4Nanyang Technological University.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Our project is available at https://github.com/DCDmllm/MorphTokens.
Open Datasets Yes We use 30M image-text pairs from CC3M (Sharma et al., 2018) and Laion (Christoph et al., 2022)
Dataset Splits Yes MS-COCO (Lin et al., 2014), (with 30K randomly sampled data from the validation set and 5K data from the Karpathy test set), and Flickr30K (Young et al., 2014) (with 1K data in the test set)
Hardware Specification Yes The training is conducted on 16x A800 GPUs.
Software Dependencies No The paper mentions software components like 'Adam W optimizer', 'cosine learning rate scheduler', 'Vicuna', and 'Lo RA' but does not provide specific version numbers for these or other key software dependencies like programming languages or deep learning frameworks.
Experiment Setup Yes In particular, we set the token length: |M| = 32 and |Y| = 512. Recall that although this stage requires equality between preand post-MLLM morph-tokens, there is no conflict due to the absence of a visual generation objective. The resultant vocabulary size of our MLLM is 8,192 morph-tokens and 32,000 text-tokens. The hyperparameters for the Adam W optimizer are set with β = (0.9, 0.999), and we apply a weight decay of 0.05. The training is conducted on 16x A800 GPUs. For the first two stages, we train for 200,000 steps with a maximum learning rate as 1e-4. During instruction tuning, the model is trained for 100,000 steps with a maximum learning rate of 1e-5.