Unified Generative and Discriminative Training for Multi-modal Large Language Models

Authors: Wei Chow, Juncheng Li, Qifan Yu, Kaihang Pan, Hao Fei, Zhiqi Ge, Shuaiyang , Siliang Tang, Hanwang Zhang, QIANRU SUN

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate the effectiveness of our approach, achieving state-of-the-art results in multiple generative tasks, especially those requiring cognitive and discrimination abilities. Additionally, our method surpasses discriminative benchmarks in interleaved and fine-grained retrieval tasks.
Researcher Affiliation Academia 1Zhejiang University 2National University of Singapore 3Nanyang Technological University 4Singapore Management University
Pseudocode No The paper describes its methods using text, diagrams, and mathematical formulas but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes The project repository is here.
Open Datasets Yes Our vision-language task datasets are a subset of VILA [52], including MMC4 [104], COYO [9], LLa VA-1.5 SFT dataset [55].
Dataset Splits Yes The split of test sets and the evaluation metrics are aligned with those described in VILA[52] and LLa VA [55]. We evaluated the performance of Sugar on the widely adopted MSCOCO [38] dataset in the context of a standard image-text retrieval task. Sugar demonstrated comparable performance to FROMAGe [40] in R@1 and surpassed it in R@5 and R@10, highlighting Sugar s superiority in normal retrieval tasks. What s more we then utilize FAISS [37], a powerful library for efficient similarity searches in dense vector spaces, to index and retrieve candidates. Therefore, the results may exhibit slight differences when compared under identical settings. The results in Table 3(a) are provided for reference only.
Hardware Specification Yes Training is conducted on 8 x A800 GPUs for approximately 12 hours.
Software Dependencies No The paper mentions using CLIP Vi T-L/14 and Vicuna 1.5 models, LoRA tuning, and Adam W optimizer, but does not provide specific version numbers for underlying software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes In our implementation, we set the rank r = 128 and α = 256. We utilize the Adam W optimizer [60] in conjunction with a cosine learning rate scheduler. The hyperparameters for the Adam W optimizer are configured with a warm-up ratio of 0.03 and a maximum learning rate of 1e 4.