Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Understanding and Enhancing the Transferability of Jailbreaking Attacks
Authors: Runqi Lin, Bo Han, Fengwang Li, Tongliang Liu
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that Pi F provides an effective and efficient red-teaming evaluation for proprietary LLMs. |
| Researcher Affiliation | Academia | Runqi Lin Sydney AI Centre, The University of Sydney EMAIL Bo Han Hong Kong Baptist University EMAIL Fengwang Li The University of Sydney EMAIL Tongliang Liu Sydney AI Centre, The University of Sydney EMAIL |
| Pseudocode | Yes | A ALGORITHM The three-stage Pi F algorithm is summarised in Algorithm 1. Algorithm 1: Perceived-importance Flatten Method |
| Open Source Code | Yes | Our implementation can be found at https://github.com/tmllab/2025_ICLR_Pi F. |
| Open Datasets | Yes | We evaluate our approach on two benchmark datasets: Adv Bench (Zou et al., 2023) and Malicious Instruct (Huang et al., 2023a), which contain 520 and 100 malicious inputs, respectively. Table 7. Links to datasets. Dataset Link Adv Bench https://github.com/llm-attacks/llm-attacks/tree/main/data/advbench Malicious Instruct https://github.com/Princeton-SysML/Jailbreak_LLM/blob/main/data |
| Dataset Splits | No | The paper mentions using Adv Bench and Malicious Instruct datasets, stating they "contain 520 and 100 malicious inputs, respectively." However, it does not provide specific details on how these datasets are split into training, validation, or test sets for the authors' own experiments or method reproduction. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used for running its experiments, such as GPU models (e.g., NVIDIA A100) or CPU models. It mentions 'National Computational Infrastructure (NCI Australia)' in the acknowledgments but no specific hardware details. |
| Software Dependencies | No | The paper mentions using models like Bert-Large, Llama-2, and GPT-2-Large, and refers to concepts like 'perplexity filter' which likely use libraries such as Hugging Face Transformers. However, it does not provide specific version numbers for these software components or any programming languages like Python. |
| Experiment Setup | Yes | Setup for Pi F. We employ Bert-Large (Devlin et al., 2019) as the source model with the evaluation template This intent is [MASK] . The hyperparameters are configured as follows: the number of iterations T is set to 50; the temperature τ is set to 0.25; the threshold Θ is set to 0.85; and the values of N, M, and K are all set to 15. |