reproducibilityindex.ai

Attention as Implicit Structural Inference

Authors: Ryan Singh, Christopher L Buckley

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Here we investigate two and demonstrate their behaviour on explanatory toy problems: (a) extending the value function to incorporate more nodes of a graphical model yielding a mechanism with a bias toward attending multiple tokens; (b) introducing a geometric prior (with conjugate hyper-prior) producing a mechanism which dynamically scales the context window depending on input.Figure 2: Multihop Attention: (left) Graphical description of the toy problem, x2 is generated causally from x1 and x0, which are used to generate y. (centre) Comparison of the attention employed by Multihop which takes two steps on the attention graph (top) contrasted with Self Attention (bottom). Multihop Attention has the correct bias to learn the task approaching the performance of two-layer Self Attention, while a single layer of Self Attention is unable (top right). Empirically examining the attention weights, Multihop Attention is able to balance attention across two positions, while self-attention favours a single position.9 Experimental Details
Researcher Affiliation	Collaboration	Ryan Singh School of Engineering and Informatics, University of Sussex. rs773@sussex.ac.uk Christopher L. Buckley School of Engineering and Informatics, University of Sussex. VERSES AI Research Lab, Los Angeles, CA, USA.
Pseudocode	Yes	Algorithm 1 Attention, Algorithm 2 Multihop, Algorithm 3 Expanding
Open Source Code	No	The paper does not include an unambiguous statement about releasing code for the methodology described, nor does it provide a direct link to a source-code repository.
Open Datasets	No	The paper describes 'Task Setup' where 'We simulate a simple dataset' and 'Input and target sequence are generated similarly to above'. There is no mention of a publicly available dataset, nor are there any links, DOIs, repositories, or formal citations provided for accessing the simulated data.
Dataset Splits	No	The paper provides 'Training parameters' but does not specify exact training/validation/test dataset splits, percentages, or absolute sample counts. It describes custom data generation but not how it's partitioned for training, validation, or testing.
Hardware Specification	No	The paper does not provide any specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies	No	The paper mentions 'torch.rand' for matrix initialization, implying the use of PyTorch, but does not provide a specific version number for PyTorch or any other software dependencies like the ADAM optimizer.
Experiment Setup	Yes	Training parameters (across all models): batch size:200, number of batches: 10, optimiser: ADAM, learning rate: 1e-3, Number of different random seeds: 10. For expanding attention the hyperparameters were set as α = .1, β = .9.