reproducibilityindex.ai

Case-Based or Rule-Based: How Do Transformers Do the Math?

Authors: Yi Hu, Xiaojuan Tang, Haotong Yang, Muhan Zhang

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through carefully designed intervention experiments on five math tasks, we confirm that transformers are performing case-based reasoning, no matter whether scratchpad is used, which aligns with the previous observations that transformers use subgraph matching/shortcut learning to reason.
Researcher Affiliation	Academia	1Institute for Artificial Intelligence, Peking University 2National Key Laboratory of General Artificial Intelligence, BIGAI.
Pseudocode	Yes	def sum_digit_by_digit(num1, num2): result=[] carry=0 while num1 or num2: digit1=num1.pop() if num1 else 0 digit2=num2.pop() if num2 else 0 total=digit1+digit2+carry result.insert(0,total%10) carry=total//10 if carry: result.insert(0,carry) return result
Open Source Code	Yes	Code is available at https://github.com/GraphPKU/Case_or_Rule.
Open Datasets	Yes	We focus on binary operations, which take two numbers a, b as inputs. Denoting c as the target label, we construct datasets like D = {((ai, bi), ci)} for five math tasks including addition, modular addition, base addition, linear regression, and chicken & rabbit problem: ... Code is available at https://github.com/GraphPKU/Case_or_Rule.
Dataset Splits	Yes	Then, we artificially split the dataset by leaving out some continuous regions of examples as the test set with the remaining ones as the training set and re-train the model. ... we extract a square comprising 441 samples (from a total of approximately 10,000 samples) with a side length of 20 to form our test set, leaving the remainder as the training set. ... For comparison, we also fine-tune these models on datasets that are randomly split, where each training set comprises 70% of the total dataset.
Hardware Specification	No	No specific hardware details (e.g., GPU models, CPU types, memory amounts) used for running experiments are provided in the paper.
Software Dependencies	No	The paper mentions models like GPT-2, Llama-2-7B, and GPT-3.5-turbo, and refers to fine-tuning with the Open AI API, but does not provide specific version numbers for software dependencies such as deep learning frameworks or libraries.
Experiment Setup	Yes	We list all the hyper-parameters used in the paper in Table 1. Models training epoch batch size learning rate case-based reasoning GPT-2 100 30 1e-4 Llama-2-7B 4 4 2e-5 rule-following fine-tuning GPT-3.5 4 4 Open AI API default value Llama-2-7B 1 8 2e-5