Towards Understanding Inductive Bias in Transformers: A View From Infinity

Authors: Itay Lavie, Guy Gur-Ari, Zohar Ringel

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show experimentally the learnability bounds found based on the dimension of the relevant irreducible representations are tight. We analyze Wiki Text-2 and show evidence for an approximate permutation symmetry in its principal components, suggesting that the toolbox presented can be of use in natural language datasets.
Researcher Affiliation Collaboration 1Racah Institute of Physics, Hebrew University of Jerusalem, Jerusalem 91904, Israel 2Augment Computing.
Pseudocode No The paper does not include pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not provide any explicit statements about open-sourcing its code or links to a code repository.
Open Datasets Yes We use a mixture of hidden Markov models (HMMs) (Baum & Petrie, 1966) as a dataset. Finally, we argue Wiki Text dataset, does indeed possess a degree of permutation symmetry. We analyze Wiki Text-2 and show evidence for an approximate permutation symmetry in its principal components, suggesting that the toolbox presented can be of use in natural language datasets.
Dataset Splits No The paper mentions training on a mixture of HMMs and testing on different distributions (train and test distributions for MSE loss), but it does not specify a separate validation dataset split or a cross-validation methodology.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments (e.g., CPU/GPU models, memory, or cluster specifications).
Software Dependencies No The paper does not specify any software dependencies with version numbers (e.g., programming languages, libraries, or frameworks).
Experiment Setup Yes The NN is trained on 8,000 samples drawn from the mixture p, q U(0, 0.4) with SGD with a mini-batch size of 50 and a learning rate of 10 3 for 10,000 epochs. The weights are initialized according to Le Cun initialization, meaning the weights in each layer are i.i.d with w N(0, 1 fan in), and the biases are initialized to zero.