Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Transformers need glasses! Information over-squashing in language tasks

Authors: Federico Barbero, Andrea Banino, Steven Kapturowski, Dharshan Kumaran, João Madeira Araújo, Oleksandr Vitvitskyi, Razvan Pascanu, Petar Veličković

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide empirical evidence supporting our claims on contemporary LLMs.
Researcher Affiliation Collaboration Federico Barbero University of Oxford EMAIL Andrea Banino Google Deep Mind EMAIL
Pseudocode No The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code No We commit to releasing the code we have used to generate the prompts in the near future.
Open Datasets No The paper uses pre-trained models (Gemini 1.5, Gemma 7B) and for synthetic experiments, generates data: 'We sample key, query, and values from a Gaussian distribution...'. No specific public dataset is mentioned for access or download.
Dataset Splits No We do not perform any training in our work and use pre-trained models.
Hardware Specification No We run a local version of Gemma 7B on modest hardware to analyse the internal representations.
Software Dependencies No The paper mentions 'bf16' (bfloat16) precision and refers to LLMs like Gemini 1.5 and Gemma 7B, but it does not specify software libraries or frameworks with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup Yes For the sum experiment we prompt as: Please perform the following sum: seq. Please give the answer on the final line exactly as The final answer to your maths question is: xxxx , where xxxx is your answer. ... We set d 64 and otherwise follow the exact structure of the decoderonly Transformer presented in the original Transformer paper. We experiment with a single attention layer... We consider a Transformer with a hidden dimension of 64, a single attention head, and we apply normalisations to simulate layer norm.