Learning the greatest common divisor: explaining transformer predictions

Authors: Francois Charton

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental I train 4-layer transformers to compute the greatest common divisor (GCD) of two positive integers, an important operation for rational arithmetic and number theory, and observe that: 1. Transformers learn to cluster input pairs with the same GCD. 2. Transformer predictions can be fully characterized. 3. Early during training, transformers learn to predict products of divisors of the base used to represent integers. 4. Models trained from log-uniform operands and outcomes achieve better performance. They correctly predict up to 91 GCD 100. 5. An unbalanced distribution of outcomes in the training set is required for full explainability: explainability partially fails once models are trained from uniformly distributed GCD.
Researcher Affiliation Industry François Charton Meta AI fcharton@meta.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes The source code for these experiments can be found at https://github.com/facebookresearch/GCD.
Open Datasets No All inputs pairs are sampled uniformly between 1 and M = 106. All data is generated on the fly: different training epochs use different examples for the train and test set. The paper does not provide concrete access information (specific link, DOI, repository name, formal citation with authors/year) for a publicly available or open dataset, as the data is generated on-the-fly.
Dataset Splits No All data is generated on the fly: different training epochs use different examples for the train and test set. After each epoch (300,000 examples), the models are evaluated on two test sets of 100,000 examples. The paper does not provide specific dataset split information for a validation set.
Hardware Specification Yes All experiments are run on one NVIDIA V100 GPU with 32 GB of memory.
Software Dependencies No Transformers with 4 layers, 512 dimensions and 8 attention heads, using Adam (Kingma & Ba, 2014) are trained with a learning rate of 10 5. The paper mentions software like 'Adam' but does not specify library names with version numbers needed for replication.
Experiment Setup Yes Transformers with 4 layers, 512 dimensions and 8 attention heads, using Adam (Kingma & Ba, 2014) are trained with a learning rate of 10 5 (no scheduling is needed) on batches of 256 examples. All inputs pairs are sampled uniformly between 1 and M = 106. After each epoch (300,000 examples), the models are evaluated on two test sets of 100,000 examples.