Case-Based or Rule-Based: How Do Transformers Do the Math?
Authors: Yi Hu, Xiaojuan Tang, Haotong Yang, Muhan Zhang
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through carefully designed intervention experiments on five math tasks, we confirm that transformers are performing case-based reasoning, no matter whether scratchpad is used, which aligns with the previous observations that transformers use subgraph matching/shortcut learning to reason. |
| Researcher Affiliation | Academia | 1Institute for Artificial Intelligence, Peking University 2National Key Laboratory of General Artificial Intelligence, BIGAI. |
| Pseudocode | Yes | def sum_digit_by_digit(num1, num2): result=[] carry=0 while num1 or num2: digit1=num1.pop() if num1 else 0 digit2=num2.pop() if num2 else 0 total=digit1+digit2+carry result.insert(0,total%10) carry=total//10 if carry: result.insert(0,carry) return result |
| Open Source Code | Yes | Code is available at https://github.com/GraphPKU/Case_or_Rule. |
| Open Datasets | Yes | We focus on binary operations, which take two numbers a, b as inputs. Denoting c as the target label, we construct datasets like D = {((ai, bi), ci)} for five math tasks including addition, modular addition, base addition, linear regression, and chicken & rabbit problem: ... Code is available at https://github.com/GraphPKU/Case_or_Rule. |
| Dataset Splits | Yes | Then, we artificially split the dataset by leaving out some continuous regions of examples as the test set with the remaining ones as the training set and re-train the model. ... we extract a square comprising 441 samples (from a total of approximately 10,000 samples) with a side length of 20 to form our test set, leaving the remainder as the training set. ... For comparison, we also fine-tune these models on datasets that are randomly split, where each training set comprises 70% of the total dataset. |
| Hardware Specification | No | No specific hardware details (e.g., GPU models, CPU types, memory amounts) used for running experiments are provided in the paper. |
| Software Dependencies | No | The paper mentions models like GPT-2, Llama-2-7B, and GPT-3.5-turbo, and refers to fine-tuning with the Open AI API, but does not provide specific version numbers for software dependencies such as deep learning frameworks or libraries. |
| Experiment Setup | Yes | We list all the hyper-parameters used in the paper in Table 1. Models training epoch batch size learning rate case-based reasoning GPT-2 100 30 1e-4 Llama-2-7B 4 4 2e-5 rule-following fine-tuning GPT-3.5 4 4 Open AI API default value Llama-2-7B 1 8 2e-5 |