Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks

Authors: Stephen Obadinma, Xiaodan Zhu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we present the first comprehensive study on the robustness of verbal confidence under adversarial attacks. We introduce attack frameworks targeting verbal confidence scores through both perturbation and jailbreak-based methods, and demonstrate that these attacks can significantly impair verbal confidence estimates and lead to frequent answer changes. We examine a variety of prompting strategies, model sizes, and application domains, revealing that current verbal confidence is vulnerable and that commonly used defence techniques are largely ineffective or counterproductive.
Researcher Affiliation Academia Stephen Obadinma, Xiaodan Zhu Department of Electrical and Computer Engineering, Ingenuity Labs Research Institute, Queen s University EMAIL
Pseudocode Yes Algorithm 1 Confidence Triggers Optimization Algorithm 2 Overview of Tested Attack Methods Algorithm 3 Confidence Triggers Cross-Over Algorithm 4 Confidence Triggers Mutation Algorithm 5 Confidence Triggers Average Confidence Function Algorithm 6 Confidence Triggers-Auto DAN Algorithm 7 Smooth LLM-Confidence-Attack
Open Source Code No Code is not currently provided but can be made available on acceptance.
Open Datasets Yes Data. We use popular Question Answering (QA) benchmark datasets that cover the generic, medical, and legal domains: (1) Med MCQA (MQA) [51] (2) Truthful QA (TQA) [39] (3) Strategy QA (SQA) [18] We also use the Adv GLUE [65] adversarial benchmark for confidence attack agnostic results, specifically the SST-2 [61] set. Details can be found in Appendix C.
Dataset Splits Yes For each dataset we randomly select 200 examples and test attacks against both correctly and incorrectly answered ones with the average performance across the examples. For Med MCQA and Strategy QA, we use 70% of the data for optimizing the triggers, and evaluate it on the remaining 30% (out of which 200 random samples are selected). Given the relatively smaller size of Truthful QA, we train the triggers on questions with five multiple choices options (yielding 197 different examples) and evaluate on the 203 questions with four options.
Hardware Specification No The Llama models are ran locally while the GPT models are ran through the Open AI API.
Software Dependencies No The paper mentions using Universal Sentence Encoder (USE), WordNet, GloVe embeddings, and NLTK but does not provide specific version numbers for these software components.
Experiment Setup Yes In terms of sampling parameters for each model, we utilize a temperature of 0.0 for Base, Co T, and MS method, along with a top_p of 0.95. For SC and SC-CA, we set the temperature to 0.7 and sample three different responses and average the confidence scores. We limit the number of generated tokens to 400. For the MS method, we calculate the geometric mean of the generated confidence scores for each step including the final. For our trigger prompt length Ξ² we choose 20 tokens, and for the total number of prompts we attempt to optimize Ξ± we also choose 20. We experimented with optimizing between 20 and 50 total generations. For the number of samples for the initial estimation S, we choose 200, and the value ΞΎ for each iteration we choose 12, leading to 48 different sampled training cases used for estimation for each prompt each generation. We choose a fifth of prompts (i.e., 4) with the lowest loss to be elites.