reproducibilityindex.ai

NExT-GPT: Any-to-Any Multimodal LLM

Authors: Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, Tat-Seng Chua

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In the experiments, we aim to quantify the performance of NEx T-GPT on a range of downstream tasks requiring perceiving and generating any modalities. Also due to the space limitation, we present a good number of more experimental results and analyses in Appendix D. Firstly, we evaluate the semantic understanding capability of the NEx T-GPT w.r.t. image, video, or audio, across multiple benchmarks. The results are shown in Table 2, and 3.
Researcher Affiliation	Academia	Shengqiong Wu 1 Hao Fei 1 Leigang Qu 1 Wei Ji 1 Tat-Seng Chua 1 1NEx T++ Research Center, National University of Singapore, Singapore. Correspondence to: Hao Fei <haofei37@nus.edu.sg>.
Pseudocode	No	The paper describes the architecture and training procedures using text and diagrams (Figure 1, 2, 3), but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Project website: https://next-gpt.github.io/
Open Datasets	Yes	Specifically, we utilize three types of X-caption pair data, including 1) Video-caption pair dataset: Webvid-2M (Bain et al., 2021), a large-scale dataset of short videos with textual description sourced from stock footage sites, 2) Image-caption pair dataset: CC3M (Sharma et al., 2018), contains over 3 million images accompanied by diverse styles of natural-language descriptions, and 3) Audio-caption pair dataset: Audio Caps (Kim et al., 2019), an extensive dataset of approximately 46k audio clips paired with human-written textual descriptions collected via crowdsourcing. Furthermore, we construct a modality-switching IT dataset with 5k instances, named Mos IT.
Dataset Splits	No	The paper mentions using various datasets and conducting fine-tuning, but it does not explicitly provide details about specific training, validation, or test dataset splits (e.g., percentages or sample counts for each split) within the paper text.
Hardware Specification	No	The paper does not provide specific details regarding the hardware used for running experiments (e.g., specific GPU models, CPU types, or memory configurations).
Software Dependencies	No	The paper mentions various models and optimizers such as Vicuna (7B-v0), Stable Diffusion (SD-v1.5), and the Adam optimizer, but it does not specify version numbers for general software dependencies like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or specific library versions.
Experiment Setup	Yes	In Table 7, we list the detailed hyper-parameters setting at three stages. ... Learning Rate 0.0004, Weight Decay 0.001, Training Epochs 1, Warmup Ratio 0.1, Batch Size Per GPU 18 (Stage-1), 8 (Stage-2), 4 (Stage-3), Maximum Token Length 512.