Transformers need glasses! Information over-squashing in language tasks
Authors: Federico Barbero, Andrea Banino, Steven Kapturowski, Dharshan Kumaran, João Madeira Araújo, Oleksandr Vitvitskyi, Razvan Pascanu, Petar Veličković
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide empirical evidence supporting our claims on contemporary LLMs. |
| Researcher Affiliation | Collaboration | Federico Barbero University of Oxford federico.barbero@cs.ox.ac.uk Andrea Banino Google Deep Mind abanino@google.com |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | No | We commit to releasing the code we have used to generate the prompts in the near future. |
| Open Datasets | No | The paper uses pre-trained models (Gemini 1.5, Gemma 7B) and for synthetic experiments, generates data: 'We sample key, query, and values from a Gaussian distribution...'. No specific public dataset is mentioned for access or download. |
| Dataset Splits | No | We do not perform any training in our work and use pre-trained models. |
| Hardware Specification | No | We run a local version of Gemma 7B on modest hardware to analyse the internal representations. |
| Software Dependencies | No | The paper mentions 'bf16' (bfloat16) precision and refers to LLMs like Gemini 1.5 and Gemma 7B, but it does not specify software libraries or frameworks with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions). |
| Experiment Setup | Yes | For the sum experiment we prompt as: Please perform the following sum: seq. Please give the answer on the final line exactly as The final answer to your maths question is: xxxx , where xxxx is your answer. ... We set d 64 and otherwise follow the exact structure of the decoderonly Transformer presented in the original Transformer paper. We experiment with a single attention layer... We consider a Transformer with a hidden dimension of 64, a single attention head, and we apply normalisations to simulate layer norm. |