NExT-GPT: Any-to-Any Multimodal LLM
Authors: Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, Tat-Seng Chua
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In the experiments, we aim to quantify the performance of NEx T-GPT on a range of downstream tasks requiring perceiving and generating any modalities. Also due to the space limitation, we present a good number of more experimental results and analyses in Appendix D. Firstly, we evaluate the semantic understanding capability of the NEx T-GPT w.r.t. image, video, or audio, across multiple benchmarks. The results are shown in Table 2, and 3. |
| Researcher Affiliation | Academia | Shengqiong Wu 1 Hao Fei 1 Leigang Qu 1 Wei Ji 1 Tat-Seng Chua 1 1NEx T++ Research Center, National University of Singapore, Singapore. Correspondence to: Hao Fei <haofei37@nus.edu.sg>. |
| Pseudocode | No | The paper describes the architecture and training procedures using text and diagrams (Figure 1, 2, 3), but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Project website: https://next-gpt.github.io/ |
| Open Datasets | Yes | Specifically, we utilize three types of X-caption pair data, including 1) Video-caption pair dataset: Webvid-2M (Bain et al., 2021), a large-scale dataset of short videos with textual description sourced from stock footage sites, 2) Image-caption pair dataset: CC3M (Sharma et al., 2018), contains over 3 million images accompanied by diverse styles of natural-language descriptions, and 3) Audio-caption pair dataset: Audio Caps (Kim et al., 2019), an extensive dataset of approximately 46k audio clips paired with human-written textual descriptions collected via crowdsourcing. Furthermore, we construct a modality-switching IT dataset with 5k instances, named Mos IT. |
| Dataset Splits | No | The paper mentions using various datasets and conducting fine-tuning, but it does not explicitly provide details about specific training, validation, or test dataset splits (e.g., percentages or sample counts for each split) within the paper text. |
| Hardware Specification | No | The paper does not provide specific details regarding the hardware used for running experiments (e.g., specific GPU models, CPU types, or memory configurations). |
| Software Dependencies | No | The paper mentions various models and optimizers such as Vicuna (7B-v0), Stable Diffusion (SD-v1.5), and the Adam optimizer, but it does not specify version numbers for general software dependencies like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or specific library versions. |
| Experiment Setup | Yes | In Table 7, we list the detailed hyper-parameters setting at three stages. ... Learning Rate 0.0004, Weight Decay 0.001, Training Epochs 1, Warmup Ratio 0.1, Batch Size Per GPU 18 (Stage-1), 8 (Stage-2), 4 (Stage-3), Maximum Token Length 512. |