Attention as Implicit Structural Inference

Authors: Ryan Singh, Christopher L Buckley

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here we investigate two and demonstrate their behaviour on explanatory toy problems: (a) extending the value function to incorporate more nodes of a graphical model yielding a mechanism with a bias toward attending multiple tokens; (b) introducing a geometric prior (with conjugate hyper-prior) producing a mechanism which dynamically scales the context window depending on input.Figure 2: Multihop Attention: (left) Graphical description of the toy problem, x2 is generated causally from x1 and x0, which are used to generate y. (centre) Comparison of the attention employed by Multihop which takes two steps on the attention graph (top) contrasted with Self Attention (bottom). Multihop Attention has the correct bias to learn the task approaching the performance of two-layer Self Attention, while a single layer of Self Attention is unable (top right). Empirically examining the attention weights, Multihop Attention is able to balance attention across two positions, while self-attention favours a single position.9 Experimental Details
Researcher Affiliation Collaboration Ryan Singh School of Engineering and Informatics, University of Sussex. rs773@sussex.ac.uk Christopher L. Buckley School of Engineering and Informatics, University of Sussex. VERSES AI Research Lab, Los Angeles, CA, USA.
Pseudocode Yes Algorithm 1 Attention, Algorithm 2 Multihop, Algorithm 3 Expanding
Open Source Code No The paper does not include an unambiguous statement about releasing code for the methodology described, nor does it provide a direct link to a source-code repository.
Open Datasets No The paper describes 'Task Setup' where 'We simulate a simple dataset' and 'Input and target sequence are generated similarly to above'. There is no mention of a publicly available dataset, nor are there any links, DOIs, repositories, or formal citations provided for accessing the simulated data.
Dataset Splits No The paper provides 'Training parameters' but does not specify exact training/validation/test dataset splits, percentages, or absolute sample counts. It describes custom data generation but not how it's partitioned for training, validation, or testing.
Hardware Specification No The paper does not provide any specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies No The paper mentions 'torch.rand' for matrix initialization, implying the use of PyTorch, but does not provide a specific version number for PyTorch or any other software dependencies like the ADAM optimizer.
Experiment Setup Yes Training parameters (across all models): batch size:200, number of batches: 10, optimiser: ADAM, learning rate: 1e-3, Number of different random seeds: 10. For expanding attention the hyperparameters were set as α = .1, β = .9.