Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Semantic Representation Attack against Aligned Large Language Models

Authors: Jiawei Lian, Jianhong Pan, Lefan Wang, Yi Wang, Shaohui Mei, Lap-Pui Chau

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments demonstrate that our method significantly outperforms existing approaches, achieving an unprecedented 100% attack success rate (ASR) on 11 out of 18 LLMs, as shown in Fig. 1. Our experiments utilize two benchmark datasets: Harm Bench [34], which provides standardized evaluation of adversarial attacks against aligned LLMs across diverse harmful scenarios1, and Adv Bench [68]. We evaluate attack performance using attack success rate (ASR), same as previous works [68, 31, 41]. The ASR is calculated as ASR = ns nt , where ns is the number of successful attacks and nt is the total number of data points. For the stealthiness evaluation, we calculate the PPL scores of adversarial prompts using target LLMs, where a lower perplexity score indicates a more semantically coherent adversarial prompt. The perplexity score is calculated as Eq. 3.
Researcher Affiliation Academia 1Department of Electrical and Electronic Engineering, The Hong Kong Polytechnic University, Hong Kong SAR 2School of Electronics and Information, Northwestern Polytechnical University, Xi an, China
Pseudocode Yes Algorithm 1: Semantic Representation Heuristic Search (SRHS) Input :Malicious user query token sequence q and semantic representation Φ of corresponding malicious responses, template token sequences s1 and s2, adversarial threshold τ, vocabulary V, semantic representation mapping function R Output :Adversarial prompt set A
Open Source Code Yes The code is available at https://github.com/Jiawei Lian/SRA.git.
Open Datasets Yes Our experiments utilize two benchmark datasets: Harm Bench [34], which provides standardized evaluation of adversarial attacks against aligned LLMs across diverse harmful scenarios1, and Adv Bench [68], which ensures the fairness of efficiency comparison with prior works by following the evaluation protocol established by the efficient attack framework BEAST [41].
Dataset Splits Yes Our experiments utilize two benchmark datasets: Harm Bench [34], which provides standardized evaluation of adversarial attacks against aligned LLMs across diverse harmful scenarios1, and Adv Bench [68], which ensures the fairness of efficiency comparison with prior works by following the evaluation protocol established by the efficient attack framework BEAST [41].
Hardware Specification Yes All time-budgeted experiments run on a single NVIDIA RTX 6000 Ada GPU with 48GB of memory.
Software Dependencies No The paper mentions specific LLM models used (e.g., Llama 2 13B, Llama 3.1 8B) but does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA that would be necessary to replicate the experiments.
Experiment Setup Yes We set the response length to 512 tokens and configure the batch size with a default of 256, automatically adjusting based on available GPU memory. For the coherence threshold τ, we use an adaptive approach in effectiveness experiments for optimal performance (see B.2) and set it to 20 for fair efficiency comparison. To account for the platykurtic distribution characteristic of some LLMs, we constrain the candidate tokens of each node through top-k sampling with k = 50.