Unified Generative and Discriminative Training for Multi-modal Large Language Models
Authors: Wei Chow, Juncheng Li, Qifan Yu, Kaihang Pan, Hao Fei, Zhiqi Ge, Shuaiyang , Siliang Tang, Hanwang Zhang, QIANRU SUN
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate the effectiveness of our approach, achieving state-of-the-art results in multiple generative tasks, especially those requiring cognitive and discrimination abilities. Additionally, our method surpasses discriminative benchmarks in interleaved and fine-grained retrieval tasks. |
| Researcher Affiliation | Academia | 1Zhejiang University 2National University of Singapore 3Nanyang Technological University 4Singapore Management University |
| Pseudocode | No | The paper describes its methods using text, diagrams, and mathematical formulas but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The project repository is here. |
| Open Datasets | Yes | Our vision-language task datasets are a subset of VILA [52], including MMC4 [104], COYO [9], LLa VA-1.5 SFT dataset [55]. |
| Dataset Splits | Yes | The split of test sets and the evaluation metrics are aligned with those described in VILA[52] and LLa VA [55]. We evaluated the performance of Sugar on the widely adopted MSCOCO [38] dataset in the context of a standard image-text retrieval task. Sugar demonstrated comparable performance to FROMAGe [40] in R@1 and surpassed it in R@5 and R@10, highlighting Sugar s superiority in normal retrieval tasks. What s more we then utilize FAISS [37], a powerful library for efficient similarity searches in dense vector spaces, to index and retrieve candidates. Therefore, the results may exhibit slight differences when compared under identical settings. The results in Table 3(a) are provided for reference only. |
| Hardware Specification | Yes | Training is conducted on 8 x A800 GPUs for approximately 12 hours. |
| Software Dependencies | No | The paper mentions using CLIP Vi T-L/14 and Vicuna 1.5 models, LoRA tuning, and Adam W optimizer, but does not provide specific version numbers for underlying software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | In our implementation, we set the rank r = 128 and α = 256. We utilize the Adam W optimizer [60] in conjunction with a cosine learning rate scheduler. The hyperparameters for the Adam W optimizer are configured with a warm-up ratio of 0.03 and a maximum learning rate of 1e 4. |