Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Varying Shades of Wrong: Aligning LLMs with Wrong Answers Only
Authors: Jihan Yao, Wenxuan Ding, Shangbin Feng, Lucy Lu Wang, Yulia Tsvetkov
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments with seven LLMs and eight datasets demonstrate that (1) LLMs do have preliminary capability in distinguishing various shades of wrong, achieving up to 20.9% higher performance than random guess; (2) Alignment with such wrong-over-wrong preferences helps LLMs to produce less wrong and sometimes even outright correct answers, while improving overall model calibration. |
| Researcher Affiliation | Collaboration | Jihan Yao 1 Wenxuan Ding 2 Shangbin Feng 1 Lucy Lu Wang1,3 Yulia Tsvetkov1 1University of Washington 2The University of Texas at Austin 3 Allen Institute for AI EMAIL EMAIL |
| Pseudocode | Yes | Full details of wrong-over-wrong dataset construction are available in Algorithm 1. ... Algorithm 1 DWo W generation pipeline |
| Open Source Code | Yes | Code and data are publicly available at https://github.com/yaojh18/Varying-Shades-of-Wrong. |
| Open Datasets | Yes | Knowledge Crosswords (KC) (Ding et al., 2023) is a multiple-choice structured knowledge reasoning benchmark... NLGraph (NLG) (Wang et al., 2023a) is a graph reasoning benchmark... Bio Generation (BG) LLMs are asked to generate a biography... COM2 (Fang et al., 2024) is a multiple-choice commonsense reasoning benchmark... Hellaswag (Zellers et al., 2019)... Chess Puzzle (Lichess Team, 2023)... Sci Bench (Wang et al., 2024b)... Med MCQA (Pal et al., 2022) |
| Dataset Splits | Yes | We sample 625, 625, 625, and 380 questions from each dataset, each split into training sets Dtrain, validation sets Dval, and test sets Dtest with an approximately 8:1:1 ratio. ... We sample 125 questions from the official validation split and split them into validation, test sets with a 1:1 ratio. |
| Hardware Specification | No | The paper does not explicitly mention specific hardware specifications like GPU models, CPU models, or memory details used for running experiments. |
| Software Dependencies | No | We employ Unsloth and Transformers libraries for preference optimization. ... We employ three open and proprietary LLMs for experiments spanning different scales and access levels. First, we use LLAMA3-8B (Dubey et al., 2024), GPT-3.5, and GPT-4O (Achiam et al., 2023) ... MISTRAL-7B (Jiang et al., 2023), GEMINI-FLASH, GEMINI-PRO (Team et al., 2023), MISTRAL-7B (Jiang et al., 2023), GEMMA-7B (Team et al., 2024). |
| Experiment Setup | Yes | We employ a temperature of 1.0 and a max generation length of 1024. ... We conduct QLo RA fine-tuning (Dettmers et al., 2023) on LLAMA3-8B using the collected wrong-over-wrong preferences through DPO. ... We apply grid search on learning rate (1e-4, 5e-5, 1e-5), learning rate scheduler (cosine, cosine with restart and reduce lr on plateau), weight decay (0, 1e-5, 1e-3) and number of train epochs (1, 3, 5) for main experiments and right-over-wrong alignment experiments. We use random seed = 42 for all of our experiments. ... In Table 1, we use batch size = 5 for all score methods due to optimal empirical results. |