BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models

Authors: Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, Bo Li

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we show the effectiveness of Bad Chain for two COT strategies across four LLMs (Llama2, GPT-3.5, Pa LM2, and GPT-4) and six complex benchmark tasks encompassing arithmetic, commonsense, and symbolic reasoning. We conduct extensive empirical evaluations for Bad Chain under different settings.
Researcher Affiliation Academia Zhen Xiang1, Fengqing Jiang2, Zidi Xiong1, Bhaskar Ramasubramanian3, Radha Poovendran2, Bo Li1 1University of Illinois Urbana-Champaign 2University of Washington 3Western Washington University
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Code related to this work is available at https://github.com/Django-Jiang/Bad Chain.
Open Datasets Yes Datasets: Following prior works on COT like (Wei et al., 2022; Wang et al., 2023b), we consider six benchmark datasets encompassing three categories of challenging reasoning tasks. For arithmetic reasoning, we consider three datasets on math word problems, including GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), and ASDiv (Miao et al., 2020). For commonsense reasoning, we consider CSQA for multiple-choice questions (Talmor et al., 2019) and Strategy QA for true or false questions (Geva et al., 2021). For symbolic reasoning, we consider Letter, a dataset for last letter concatenation by Wei et al. (2022). More details about these datasets are shown in App. A.1.
Dataset Splits Yes For each model on each dataset, we poison a specific proportion of demonstrations, which is detailed in Tab. 4 in App. A.3. Again, these choices can be easily determined in practice using merely twenty clean instances, as demonstrated by our ablation studies in Sec. 4.4.
Hardware Specification No The paper mentions the use of LLMs like GPT-3.5, GPT-4, PaLM2, and Llama2, along with some inference settings (e.g., temperature, top p, float16 data type). However, it does not specify the underlying hardware (e.g., specific GPU or CPU models, memory details) used to run these models or conduct the experiments.
Software Dependencies No The paper mentions LLMs such as GPT-3.5, GPT-4, PaLM2, and Llama2, and refers to Open AI documentation. It specifies a "float16 data type" for Llama2 inference. However, it does not provide specific version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used (e.g., Python 3.x).
Experiment Setup Yes We follow the decoding strategy as on the documentation from Open AI (2023b), including temperature to 1 and top p to 1. The decoding strategy is set to temperature = 0.7, top p = 0.95, top k = 40 by default [for PaLM2]. The decoding strategy is set to temperature = 1, top p = 0.7, top k = 50 [for Llama2].