MathAttack: Attacking Large Language Models towards Math Solving Ability
Authors: Zihao Zhou, Qiufeng Wang, Mingyu Jin, Jie Yao, Jianan Ye, Wei Liu, Wei Wang, Xiaowei Huang, Kaizhu Huang
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on our Robust Math and two another math benchmark datasets GSM8K and Multi Airth show that Math Attack could effectively attack the math solving ability of LLMs. |
| Researcher Affiliation | Academia | 1School of Advanced Technology, Xi an Jiaotong-Liverpool University 2University of Liverpool 3Northwestern University 4Shanghai Tech University 5Duke Kunshan University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code and dataset is available at: https://github.com/zhouzihao501/Math Attack. |
| Open Datasets | Yes | Two math word problems benchmark datasets GSM8K (Cobbe et al. 2021) and Multi Arith (Roy and Roth 2015) are adopted in the experiments. [...] The code and dataset is available at: https://github.com/zhouzihao501/Math Attack. |
| Dataset Splits | No | The paper describes selecting subsets of GSM8K and Multi Arith datasets (307 and 150 MWP samples respectively) for experiments, but it does not specify the train/validation/test splits or their ratios used within these datasets for model training or evaluation. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions Spacy as the NER model but does not specify its version number. Other software mentioned refers to models or APIs without specific version details. |
| Experiment Setup | Yes | We set the temperature = 0 to stabilize the output of LLMs. When attacking victim models, we not only attack them with zero-shot prompt but also few-shot prompt. Specifically, we employ four MWP samples as shots and provide Chain-of Thought (Co T) (Wei et al. 2022) annotations. |