Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MathAttack: Attacking Large Language Models towards Math Solving Ability
Authors: Zihao Zhou, Qiufeng Wang, Mingyu Jin, Jie Yao, Jianan Ye, Wei Liu, Wei Wang, Xiaowei Huang, Kaizhu Huang
AAAI 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on our Robust Math and two another math benchmark datasets GSM8K and Multi Airth show that Math Attack could effectively attack the math solving ability of LLMs. |
| Researcher Affiliation | Academia | 1School of Advanced Technology, Xi an Jiaotong-Liverpool University 2University of Liverpool 3Northwestern University 4Shanghai Tech University 5Duke Kunshan University |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code and dataset is available at: https://github.com/zhouzihao501/Math Attack. |
| Open Datasets | Yes | Two math word problems benchmark datasets GSM8K (Cobbe et al. 2021) and Multi Arith (Roy and Roth 2015) are adopted in the experiments. [...] The code and dataset is available at: https://github.com/zhouzihao501/Math Attack. |
| Dataset Splits | No | The paper describes selecting subsets of GSM8K and Multi Arith datasets (307 and 150 MWP samples respectively) for experiments, but it does not specify the train/validation/test splits or their ratios used within these datasets for model training or evaluation. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions Spacy as the NER model but does not specify its version number. Other software mentioned refers to models or APIs without specific version details. |
| Experiment Setup | Yes | We set the temperature = 0 to stabilize the output of LLMs. When attacking victim models, we not only attack them with zero-shot prompt but also few-shot prompt. Specifically, we employ four MWP samples as shots and provide Chain-of Thought (Co T) (Wei et al. 2022) annotations. |