Transformers need glasses! Information over-squashing in language tasks

Authors: Federico Barbero, Andrea Banino, Steven Kapturowski, Dharshan Kumaran, João Madeira Araújo, Oleksandr Vitvitskyi, Razvan Pascanu, Petar Veličković

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide empirical evidence supporting our claims on contemporary LLMs.
Researcher Affiliation Collaboration Federico Barbero University of Oxford federico.barbero@cs.ox.ac.uk Andrea Banino Google Deep Mind abanino@google.com
Pseudocode No The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code No We commit to releasing the code we have used to generate the prompts in the near future.
Open Datasets No The paper uses pre-trained models (Gemini 1.5, Gemma 7B) and for synthetic experiments, generates data: 'We sample key, query, and values from a Gaussian distribution...'. No specific public dataset is mentioned for access or download.
Dataset Splits No We do not perform any training in our work and use pre-trained models.
Hardware Specification No We run a local version of Gemma 7B on modest hardware to analyse the internal representations.
Software Dependencies No The paper mentions 'bf16' (bfloat16) precision and refers to LLMs like Gemini 1.5 and Gemma 7B, but it does not specify software libraries or frameworks with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup Yes For the sum experiment we prompt as: Please perform the following sum: seq. Please give the answer on the final line exactly as The final answer to your maths question is: xxxx , where xxxx is your answer. ... We set d 64 and otherwise follow the exact structure of the decoderonly Transformer presented in the original Transformer paper. We experiment with a single attention layer... We consider a Transformer with a hidden dimension of 64, a single attention head, and we apply normalisations to simulate layer norm.