reproducibilityindex.ai

Transformers need glasses! Information over-squashing in language tasks

Authors: Federico Barbero, Andrea Banino, Steven Kapturowski, Dharshan Kumaran, João Madeira Araújo, Oleksandr Vitvitskyi, Razvan Pascanu, Petar Veličković

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide empirical evidence supporting our claims on contemporary LLMs.
Researcher Affiliation	Collaboration	Federico Barbero University of Oxford federico.barbero@cs.ox.ac.uk Andrea Banino Google Deep Mind abanino@google.com
Pseudocode	No	The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code	No	We commit to releasing the code we have used to generate the prompts in the near future.
Open Datasets	No	The paper uses pre-trained models (Gemini 1.5, Gemma 7B) and for synthetic experiments, generates data: 'We sample key, query, and values from a Gaussian distribution...'. No specific public dataset is mentioned for access or download.
Dataset Splits	No	We do not perform any training in our work and use pre-trained models.
Hardware Specification	No	We run a local version of Gemma 7B on modest hardware to analyse the internal representations.
Software Dependencies	No	The paper mentions 'bf16' (bfloat16) precision and refers to LLMs like Gemini 1.5 and Gemma 7B, but it does not specify software libraries or frameworks with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup	Yes	For the sum experiment we prompt as: Please perform the following sum: seq. Please give the answer on the final line exactly as The final answer to your maths question is: xxxx , where xxxx is your answer. ... We set d 64 and otherwise follow the exact structure of the decoderonly Transformer presented in the original Transformer paper. We experiment with a single attention layer... We consider a Transformer with a hidden dimension of 64, a single attention head, and we apply normalisations to simulate layer norm.