Calibrated Self-Rewarding Vision Language Models
Authors: Yiyang Zhou, Zhiyuan Fan, Dongjie Cheng, Sihan Yang, Zhaorun Chen, Chenhang Cui, Xiyao Wang, Yun Li, Linjun Zhang, Huaxiu Yao
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results demonstrate that CSR significantly enhances performance and reduces hallucinations across ten benchmarks and tasks, achieving substantial improvements over existing methods by 7.62%. Our empirical results are further supported by rigorous theoretical analysis, under mild assumptions, verifying the effectiveness of introducing visual constraints into the self-rewarding paradigm. |
| Researcher Affiliation | Academia | Yiyang Zhou1 , Zhiyuan Fan5 , Dongjie Cheng6 , Sihan Yang7, Zhaorun Chen2 Chenhang Cui8, Xiyao Wang3, Yun Li1, Linjun Zhang4, Huaxiu Yao1 1UNC-Chapel Hill, 2University of Chicago, 3University of Maryland 4Rutgers University, 5HKUST , 6Poly U, 7NTU, 8NUS |
| Pseudocode | Yes | Algorithm 1 Calibrated Self-Rewarding |
| Open Source Code | Yes | Our data and code are available at https://github.com/Yiyang Zhou/CSR. |
| Open Datasets | Yes | The images and prompts used to construct the preference data are randomly sampled from the detailed description and complex reasoning subclasses of the LLa VA150k dataset, totaling approximately 13,000 samples [19]. |
| Dataset Splits | No | The images and prompts used to construct the preference data are randomly sampled from the detailed description and complex reasoning subclasses of the LLa VA150k dataset, totaling approximately 13,000 samples [19]. |
| Hardware Specification | Yes | Overall, the iterative training is conducted over three iterations, completed on one A100 80GB GPU. It takes roughly 3.5 and 5 hours for fine-tuning LLa VA-1.5 7B and LLa VA-1.5 13B, respectively. |
| Software Dependencies | No | We utilize LLa VA-1.5 7B and 13B [1] as the backbone models. During the preference learning process, we adapt Lo RA fine-tuning [18]. The images and prompts used to construct the preference data are randomly sampled from the detailed description and complex reasoning subclasses of the LLa VA150k dataset, totaling approximately 13,000 samples [19]. |
| Experiment Setup | Yes | The num_beams parameter, set to 5, determines the capacity of input at each search layer. Additionally, num_token_beams, also set to 5, ensures that each beam search returns 5 token-level search results. The eos_token_id is set to the token for a period, effectively controlling the sentence-by-sentence generation process. The max_length parameter, set to 1024, prevents truncation errors and infinite repetitions by controlling the maximum length, while max_new_tokens, set to 74, limits the maximum length of newly generated content to avoid exceeding the CLIP encoding limit. To further enhance data diversity, we utilize group beam search by setting the num_beam_group parameter to 5. This approach, when matched with token-level search, significantly boosts the diversity of each data point. The diversity_penalty parameter, set to a value of 3.0, effectively controls the diversity and quality of the sampled data among different beam groups. Calibrated Rewarding. We set the clip score weight to 0.9 and the language score weight to 0.1 when calculating the scores, giving greater emphasis to visual calibration. |