Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Preference Learning with Lie Detectors can Induce Honesty or Evasion

Authors: Chris Cundy, Adam Gleave

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using Dolus Chat, a novel 65k-example dataset with paired truthful/deceptive responses, we identify three key factors that determine the honesty of learned policies: amount of exploration during preference learning, lie detector accuracy, and KL regularization strength. We find that preference learning with lie detectors and GRPO can lead to policies which evade lie detectors, with deception rates of over 85%. However, if the lie detector true positive rate (TPR) or KL regularization is sufficiently high, GRPO learns honest policies.
Researcher Affiliation Industry Chris Cundy FAR.AI EMAIL Adam Gleave FAR.AI
Pseudocode No The paper describes the SOLi D approach and training procedures in text and uses flowcharts (Figure 1, Figure 23) to illustrate the process, but no structured pseudocode or algorithm blocks are provided.
Open Source Code Yes Dolus Chat is available at https://huggingface.co/datasets/alignmentresearch/Dolus Chat and the code is available at https://github.com/Alignment Research/deception-evasion-honesty.
Open Datasets Yes We constructed Dolus Chat, a 65k-example chat format dataset. ... Dolus Chat is available at https://huggingface.co/datasets/alignmentresearch/Dolus Chat ... In Appendix C.11 we evaluate our results on the MASK dataset [26]
Dataset Splits Yes Table 1: Splits of the Dolus Chat dataset used in this work. The evaluation, detector training, and preference learning splits are separate and used solely for the purpose given. Total 65,000 100% Paired truthful/deceptive responses. Evaluation 3,250 5% Final evaluation of policy Paired truthful/deceptive responses. Detector Training (Train) 2,925 4.5% Training deception detector with cross-validation. Detector Training (Validation) 325 0.5% Choosing detector decision boundary to achieve specified TPR or FPR. Preference Learning 58,500 90% Training reward model, supervised fine-tuning, and DPO/GRPO. ... We generated 21048 training examples; 5261 examples to train the logistic regression, and 1384 for testing.
Hardware Specification Yes For all experiments, we used two H100 GPUs with data parallel training. ... The final results presented here cover about 100 experiments, comprising about 2,800 H100 hours. Development of the techniques, algorithms, etc, required about 15,000 H100 hours.
Software Dependencies No The paper mentions software like transformers [37], TRL library [33], and scikit-learn (via Pedregosa et al. [24]), but does not provide specific version numbers for these software components, which is required for a reproducible description of ancillary software.
Experiment Setup Yes B.1 Default Hyperparameters: We train our models using the transformers [37] and TRL library [33], with the default hyperparameters unless otherwise specified. The Lo RA α is always set to twice the Lo RA rank. B.2 Detector: We train the detector using the elasticnet [39] logistic regression implementation from Pedregosa et al. [24]... For the SFT model, we use Lo RA [14] with a rank of 512. We use a batch size of 128, weight decay of 0.01, a cosine learning rate with linear warmup factor of 0.2, a NEFTune α [15] of 5. We use the Adam W [17] optimizer with learning rate 10 5, and parameters β1 = 0.9, β2 = 0.95. We train for one epoch. For DPO, we use Lo RA [14] with a rank of 512. We use a batch size of 256, weight decay of 10 4, and a cosine learning rate with linear warmup factor of 0.2. We use the Adam W [17] optimizer with learning rate 10 5, and parameters β1 = 0.95, β2 = 0.98. We use the CPO formulation [18] with label-smoothing factor 0.05 and RPO [21] with αRPO = 0.2. We train for two epochs. B.5 Reward Model: For the reward model, we use Lo RA [14] with a rank of 256. We use a batch size of 256, weight decay of 10 2, and a cosine learning rate with linear warmup factor of 0.1. We use the Adam W [17] optimizer with learning rate 5 10 6, and parameters β1 = 0.95, β2 = 0.98. We train for four epochs. For the GRPO model, we use Lo RA [14] with a rank of 512... We use a batch size of 512, weight decay of 10 3, and a cosine learning rate with linear warmup factor of 0.1. We use the Adam W [17] optimizer with learning rate 5 10 6, and parameters β1 = 0.95, β2 = 0.98. We train for 150,000 total episodes.