Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Auto-Encoding Morph-Tokens for Multimodal LLM
Authors: Kaihang Pan, Siliang Tang, Juncheng Li, Zhaoyu Fan, Wei Chow, Shuicheng Yan, Tat-Seng Chua, Yueting Zhuang, Hanwang Zhang
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that morph-tokens can achieve a new SOTA for multimodal comprehension and generation simultaneously. |
| Researcher Affiliation | Collaboration | 1Zhejiang University 2National University of Singapore 3Skywork AI 4Nanyang Technological University. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our project is available at https://github.com/DCDmllm/MorphTokens. |
| Open Datasets | Yes | We use 30M image-text pairs from CC3M (Sharma et al., 2018) and Laion (Christoph et al., 2022) |
| Dataset Splits | Yes | MS-COCO (Lin et al., 2014), (with 30K randomly sampled data from the validation set and 5K data from the Karpathy test set), and Flickr30K (Young et al., 2014) (with 1K data in the test set) |
| Hardware Specification | Yes | The training is conducted on 16x A800 GPUs. |
| Software Dependencies | No | The paper mentions software components like 'Adam W optimizer', 'cosine learning rate scheduler', 'Vicuna', and 'Lo RA' but does not provide specific version numbers for these or other key software dependencies like programming languages or deep learning frameworks. |
| Experiment Setup | Yes | In particular, we set the token length: |M| = 32 and |Y| = 512. Recall that although this stage requires equality between preand post-MLLM morph-tokens, there is no conflict due to the absence of a visual generation objective. The resultant vocabulary size of our MLLM is 8,192 morph-tokens and 32,000 text-tokens. The hyperparameters for the Adam W optimizer are set with β = (0.9, 0.999), and we apply a weight decay of 0.05. The training is conducted on 16x A800 GPUs. For the first two stages, we train for 200,000 steps with a maximum learning rate as 1e-4. During instruction tuning, the model is trained for 100,000 steps with a maximum learning rate of 1e-5. |