NExT-GPT: Any-to-Any Multimodal LLM

Authors: Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, Tat-Seng Chua

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In the experiments, we aim to quantify the performance of NEx T-GPT on a range of downstream tasks requiring perceiving and generating any modalities. Also due to the space limitation, we present a good number of more experimental results and analyses in Appendix D. Firstly, we evaluate the semantic understanding capability of the NEx T-GPT w.r.t. image, video, or audio, across multiple benchmarks. The results are shown in Table 2, and 3.
Researcher Affiliation Academia Shengqiong Wu 1 Hao Fei 1 Leigang Qu 1 Wei Ji 1 Tat-Seng Chua 1 1NEx T++ Research Center, National University of Singapore, Singapore. Correspondence to: Hao Fei <haofei37@nus.edu.sg>.
Pseudocode No The paper describes the architecture and training procedures using text and diagrams (Figure 1, 2, 3), but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Project website: https://next-gpt.github.io/
Open Datasets Yes Specifically, we utilize three types of X-caption pair data, including 1) Video-caption pair dataset: Webvid-2M (Bain et al., 2021), a large-scale dataset of short videos with textual description sourced from stock footage sites, 2) Image-caption pair dataset: CC3M (Sharma et al., 2018), contains over 3 million images accompanied by diverse styles of natural-language descriptions, and 3) Audio-caption pair dataset: Audio Caps (Kim et al., 2019), an extensive dataset of approximately 46k audio clips paired with human-written textual descriptions collected via crowdsourcing. Furthermore, we construct a modality-switching IT dataset with 5k instances, named Mos IT.
Dataset Splits No The paper mentions using various datasets and conducting fine-tuning, but it does not explicitly provide details about specific training, validation, or test dataset splits (e.g., percentages or sample counts for each split) within the paper text.
Hardware Specification No The paper does not provide specific details regarding the hardware used for running experiments (e.g., specific GPU models, CPU types, or memory configurations).
Software Dependencies No The paper mentions various models and optimizers such as Vicuna (7B-v0), Stable Diffusion (SD-v1.5), and the Adam optimizer, but it does not specify version numbers for general software dependencies like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or specific library versions.
Experiment Setup Yes In Table 7, we list the detailed hyper-parameters setting at three stages. ... Learning Rate 0.0004, Weight Decay 0.001, Training Epochs 1, Warmup Ratio 0.1, Batch Size Per GPU 18 (Stage-1), 8 (Stage-2), 4 (Stage-3), Maximum Token Length 512.