Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Transformers need glasses! Information over-squashing in language tasks
Authors: Federico Barbero, Andrea Banino, Steven Kapturowski, Dharshan Kumaran, João Madeira Araújo, Oleksandr Vitvitskyi, Razvan Pascanu, Petar Veličković
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide empirical evidence supporting our claims on contemporary LLMs. |
| Researcher Affiliation | Collaboration | Federico Barbero University of Oxford EMAIL Andrea Banino Google Deep Mind EMAIL |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | No | We commit to releasing the code we have used to generate the prompts in the near future. |
| Open Datasets | No | The paper uses pre-trained models (Gemini 1.5, Gemma 7B) and for synthetic experiments, generates data: 'We sample key, query, and values from a Gaussian distribution...'. No specific public dataset is mentioned for access or download. |
| Dataset Splits | No | We do not perform any training in our work and use pre-trained models. |
| Hardware Specification | No | We run a local version of Gemma 7B on modest hardware to analyse the internal representations. |
| Software Dependencies | No | The paper mentions 'bf16' (bfloat16) precision and refers to LLMs like Gemini 1.5 and Gemma 7B, but it does not specify software libraries or frameworks with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions). |
| Experiment Setup | Yes | For the sum experiment we prompt as: Please perform the following sum: seq. Please give the answer on the final line exactly as The final answer to your maths question is: xxxx , where xxxx is your answer. ... We set d 64 and otherwise follow the exact structure of the decoderonly Transformer presented in the original Transformer paper. We experiment with a single attention layer... We consider a Transformer with a hidden dimension of 64, a single attention head, and we apply normalisations to simulate layer norm. |