Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
TLDR: Token-Level Detective Reward Model for Large Vision Language Models
Authors: Deqing Fu, Tong Xiao, Rui Wang, Wang Zhu, Pengchuan Zhang, Guan Pang, Robin Jia, Lawrence Chen
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 EXPERIMENTS 5.1 TRAINING TLDR MODELS Evaluation. We evaluate TLDR model s performance on the synthetic data generated from the test split of the DOCCI dataset (Onoe et al., 2024). We measure the performance based on the metrics discussed in Section 3. As shown in Table 1, the TLDR model has slightly higher response-level accuracy than the naive binary RM. TLDR model has a 41.3 m AP(neg) and signals further room for improvements. A break-down of response-level taxonomy in Table 10 at Appendix B shows that, TLDR model performs the worst one spatial relationship taxonomy, and this resonances prior work that image grounding to spatial relationship is one of the hardest task for both image-to-text VLMs and text-to-image generations (Lin et al., 2024). We conduct further human evaluation on token-level predictions on 100 samples from Wino Ground (Thrush et al., 2022) images with captions generated by Mini CPM , Phi-Vision-3.5 and Qwen2-VL-7B. With a special focus on false negative (FN) type of errors and averaged among three human annotators, we find the TLDR model has a sentence-level FN rate of 8.7% , 10.5% and 9.8%, respectively. |
| Researcher Affiliation | Collaboration | Deqing Fu1,2 Tong Xiao1 Rui Wang1 Wang Zhu1,2 Pengchuan Zhang1 Guan Pang1 Robin Jia2 Lawrence Chen1 1Meta 2University of Southern California EMAIL, EMAIL |
| Pseudocode | No | The paper describes the model architecture and training procedures using equations and textual descriptions (e.g., Section 5.1, Equations 1-6), but does not include any distinct pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions using open-source models like Llama-3.1-70B, Pali Gemma-3B-Mix-448, and Llama-3.2-11B-Vision as backbones, citing their respective publications or blog posts. However, it does not explicitly state that the authors are open-sourcing the specific code for their Token Level Detective Reward Model (TLDR) implementation or provide any links to such a repository. |
| Open Datasets | Yes | For VQA data, we synthesize hard negatives from Visual Genome (VG100K) dataset (Krishna et al., 2016), which contains 108,077 images with over 1.7 million question-answer pairs. We synthesize hard negatives from DOCCI dataset (Onoe et al., 2024), which contains over 15,000 images and their corresponding dense captions. We evaluate TLDR model s performance on the synthetic data generated from the test split of the DOCCI dataset (Onoe et al., 2024). We evaluate Llama-3.2-Vision (Meta, 2024b) with 11B and 90B versions, GPT-4o, 4o-min and GPT-4 turbo vision (Open AI, 2024), Mini CPM (Yao et al., 2024), Pali Gemma (Beyer et al., 2024), Qwen2-VL (Wang et al., 2024b) with 2B and 7B versions, and Phi 3.5 Vision (Abdin et al., 2024) with our TLDR Model. In this section, we evaluate on Wino Ground (Thrush et al., 2022) dataset to show whether given extra token-level annotation cues, the vision language model is able to self-correct its own hallucinations. Out of 800 captions generated by GPT-4V for images in Win Ground, TLDR model flags 25 of them as including hallucinated tokens. We evaluate 3 versions of TLDR backbone model with different scales of Lo RA α. They are distinguished by τ = αinfer/αtrain, the proportion of α at inference and training time. We find that when τ = 0.25, it could improve the Pali Gemma model s performance by at most 3.7 points and could improve Llama 3.2 model s performace by at most 12.5 points. At the core of these design choices is the hardness in representing visual features, which has been reported by several early studies (Mc Kinzie et al., 2024) to be the key bottleneck towards better vision-language foundation models. Various benchmark datasets beyond MMMU (Yue et al., 2024) were proposed targeting these bottlenecks, such as BLINK (Fu et al., 2024b) and Vibe-Eval (Padlewski et al., 2024) for visual reasoning, Iso Bench (Fu et al., 2024a) and Math Vista (Lu et al., 2024) for algorithmic visual problem solving. |
| Dataset Splits | Yes | For VQA data, we synthesize hard negatives from Visual Genome (VG100K) dataset (Krishna et al., 2016), which contains 108,077 images with over 1.7 million question-answer pairs. Table 7: Statistics of Data. Overall, we have over 1M VQA data with both positive and negative answers, and over 100K caption datapoints with 650K negative captions. We oberserve that we have the least amount of spatial relationship data, because spatial relationship negatives are the hardest to synthesize and not every caption has spatial relationship descriptions. TASK Data Source TAXONOMY # POSITIVE # NEGATIVE TRAIN SET PROPORTION (%) VQA VG100K 1,179,007 1,179,007 80% We evaluate TLDR model s performance on the synthetic data generated from the test split of the DOCCI dataset (Onoe et al., 2024). Out of 800 captions generated by GPT-4V for images in Win Ground, TLDR model flags 25 of them as including hallucinated tokens. |
| Hardware Specification | Yes | Table 8: Hyperparameters for training TLDR Model with Pali Gemma Backbone. Base Model Pali Gemma-3B-Mix-448 GPU 8 NVIDIA H100 Table 9: Hyperparameters for training TLDR Model with Llama Vision Backbone. Base Model Llama-3.2-11B-Vision GPU 8 NVIDIA H100 |
| Software Dependencies | No | The paper mentions several models used (e.g., Llama-3.1-70B, Pali Gemma-3B, GPT-4o) and techniques (e.g., Lo RA), but does not specify software dependencies with version numbers like Python, PyTorch, or CUDA versions. |
| Experiment Setup | Yes | B.1 MODEL TRAINING SETUP AND HYPERPARAMETERS Table 8: Hyperparameters for training TLDR Model with Pali Gemma Backbone. Image Resolution 448 448 Number of Image Tokens 1024 Hidden Dimension Size 2048 Lo RA Rank 512 Lo RA α 128 Lo RA dropout 0.1 Batch Size 8 Gradient Accumulation Steps 8 Warmup Steps 200 Learning Rate 0.001 Learning Rate Scheduler Cosine Table 9: Hyperparameters for training TLDR Model with Llama Vision Backbone. Image Resolution 1120 1120 Number of Image Tokens 1024 Hidden Dimension Size 4096 Lo RA Rank 512 Lo RA α 128 Lo RA dropout 0.1 Batch Size 8 Gradient Accumulation Steps 8 Warmup Steps 200 Learning Rate 0.001 Learning Rate Scheduler Cosine |