Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Understanding and Enhancing the Transferability of Jailbreaking Attacks

Authors: Runqi Lin, Bo Han, Fengwang Li, Tongliang Liu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that Pi F provides an effective and efficient red-teaming evaluation for proprietary LLMs.
Researcher Affiliation Academia Runqi Lin Sydney AI Centre, The University of Sydney EMAIL Bo Han Hong Kong Baptist University EMAIL Fengwang Li The University of Sydney EMAIL Tongliang Liu Sydney AI Centre, The University of Sydney EMAIL
Pseudocode Yes A ALGORITHM The three-stage Pi F algorithm is summarised in Algorithm 1. Algorithm 1: Perceived-importance Flatten Method
Open Source Code Yes Our implementation can be found at https://github.com/tmllab/2025_ICLR_Pi F.
Open Datasets Yes We evaluate our approach on two benchmark datasets: Adv Bench (Zou et al., 2023) and Malicious Instruct (Huang et al., 2023a), which contain 520 and 100 malicious inputs, respectively. Table 7. Links to datasets. Dataset Link Adv Bench https://github.com/llm-attacks/llm-attacks/tree/main/data/advbench Malicious Instruct https://github.com/Princeton-SysML/Jailbreak_LLM/blob/main/data
Dataset Splits No The paper mentions using Adv Bench and Malicious Instruct datasets, stating they "contain 520 and 100 malicious inputs, respectively." However, it does not provide specific details on how these datasets are split into training, validation, or test sets for the authors' own experiments or method reproduction.
Hardware Specification No The paper does not explicitly describe the specific hardware used for running its experiments, such as GPU models (e.g., NVIDIA A100) or CPU models. It mentions 'National Computational Infrastructure (NCI Australia)' in the acknowledgments but no specific hardware details.
Software Dependencies No The paper mentions using models like Bert-Large, Llama-2, and GPT-2-Large, and refers to concepts like 'perplexity filter' which likely use libraries such as Hugging Face Transformers. However, it does not provide specific version numbers for these software components or any programming languages like Python.
Experiment Setup Yes Setup for Pi F. We employ Bert-Large (Devlin et al., 2019) as the source model with the evaluation template This intent is [MASK] . The hyperparameters are configured as follows: the number of iterations T is set to 50; the temperature τ is set to 0.25; the threshold Θ is set to 0.85; and the values of N, M, and K are all set to 15.