Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MixAT: Combining Continuous and Discrete Adversarial Training for LLMs
Authors: Csaba Dékány, Stefan Balauca, Dimitar I. Dimitrov, Robin Staab, Martin Vechev
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We rigorously evaluate MIXAT across a wide spectrum of state-of-the-art attacks, proposing the At Least One Attack Success Rate (ALO-ASR) metric to capture the worst-case vulnerability of models. We show MIXAT achieves substantially better robustness (ALO-ASR < 20%) compared to prior defenses (ALO-ASR > 50%), while maintaining a runtime comparable to methods based on continuous relaxations. We further analyze MIXAT in realistic deployment settings, exploring how chat templates, quantization, low-rank adapters, and temperature affect both adversarial training and evaluation... |
| Researcher Affiliation | Academia | INSAIT, Sofia University "St. Kliment Ohridski" ETH Zurich ELTE Eötvös Loránd University, Budapest, Hungary |
| Pseudocode | No | The paper describes the methodology in prose and uses conceptual diagrams (e.g., Figure 1, Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We provide our code and models at https://github.com/insait-institute/Mix AT. |
| Open Datasets | Yes | We train our models using the same harmful requests as Xhonneux et al. [7], which are similar to the ones in the Harm Bench dataset [5]. The safe answer used during adversarial training is always Sorry, I can t do that. To maintain utility, we use the Ultra Chat200k [45] utility dataset, used in both adversarial training [7] and general fine-tuning [27, 29, 46]. ... In Table 21 we list the licenses of the different datasets used in this paper. |
| Dataset Splits | Yes | we restrict our evaluation to the first 40 non-copyright-related samples in the Harm Bench test set (details in Appendix B.4). ... The dataset is sampled with the interleave_dataset function from the Hugging Face datasets library, with probabilities 0.875 (benign) and 0.125 (harmful), and with strategy first_exhausted. So the ratio of benign and malicious queries is around 7:1. With our fixed seed, the benign part consists of 8476 samples sampled from Ultra Chat200K. |
| Hardware Specification | Yes | As shown there, we train all of the models using either NVIDIA A100-40GB or NVIDIA H200 GPUs. |
| Software Dependencies | No | The paper mentions using Lo RA adapters [37] and the Adam W optimizer [47], but does not provide specific version numbers for software libraries or programming languages used in the implementation. |
| Experiment Setup | Yes | By default, models are 4-bit quantized and aligned using Lo RA adapters [37]... We use 10-step L2-bounded continuous adversarial attacks with ϵ = 0.075, and discrete PAP attacks. The default PAP sample ratio is α = 0.5... We train for 2 epochs (in contrast to 5 in CAT) with a batch size 64, a learning rate of 2e 4, the Adam W optimizer [47], and a cosine learning rate scheduler. |