Towards Understanding Inductive Bias in Transformers: A View From Infinity
Authors: Itay Lavie, Guy Gur-Ari, Zohar Ringel
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show experimentally the learnability bounds found based on the dimension of the relevant irreducible representations are tight. We analyze Wiki Text-2 and show evidence for an approximate permutation symmetry in its principal components, suggesting that the toolbox presented can be of use in natural language datasets. |
| Researcher Affiliation | Collaboration | 1Racah Institute of Physics, Hebrew University of Jerusalem, Jerusalem 91904, Israel 2Augment Computing. |
| Pseudocode | No | The paper does not include pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements about open-sourcing its code or links to a code repository. |
| Open Datasets | Yes | We use a mixture of hidden Markov models (HMMs) (Baum & Petrie, 1966) as a dataset. Finally, we argue Wiki Text dataset, does indeed possess a degree of permutation symmetry. We analyze Wiki Text-2 and show evidence for an approximate permutation symmetry in its principal components, suggesting that the toolbox presented can be of use in natural language datasets. |
| Dataset Splits | No | The paper mentions training on a mixture of HMMs and testing on different distributions (train and test distributions for MSE loss), but it does not specify a separate validation dataset split or a cross-validation methodology. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments (e.g., CPU/GPU models, memory, or cluster specifications). |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers (e.g., programming languages, libraries, or frameworks). |
| Experiment Setup | Yes | The NN is trained on 8,000 samples drawn from the mixture p, q U(0, 0.4) with SGD with a mini-batch size of 50 and a learning rate of 10 3 for 10,000 epochs. The weights are initialized according to Le Cun initialization, meaning the weights in each layer are i.i.d with w N(0, 1 fan in), and the biases are initialized to zero. |