Attention as Implicit Structural Inference
Authors: Ryan Singh, Christopher L Buckley
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Here we investigate two and demonstrate their behaviour on explanatory toy problems: (a) extending the value function to incorporate more nodes of a graphical model yielding a mechanism with a bias toward attending multiple tokens; (b) introducing a geometric prior (with conjugate hyper-prior) producing a mechanism which dynamically scales the context window depending on input.Figure 2: Multihop Attention: (left) Graphical description of the toy problem, x2 is generated causally from x1 and x0, which are used to generate y. (centre) Comparison of the attention employed by Multihop which takes two steps on the attention graph (top) contrasted with Self Attention (bottom). Multihop Attention has the correct bias to learn the task approaching the performance of two-layer Self Attention, while a single layer of Self Attention is unable (top right). Empirically examining the attention weights, Multihop Attention is able to balance attention across two positions, while self-attention favours a single position.9 Experimental Details |
| Researcher Affiliation | Collaboration | Ryan Singh School of Engineering and Informatics, University of Sussex. rs773@sussex.ac.uk Christopher L. Buckley School of Engineering and Informatics, University of Sussex. VERSES AI Research Lab, Los Angeles, CA, USA. |
| Pseudocode | Yes | Algorithm 1 Attention, Algorithm 2 Multihop, Algorithm 3 Expanding |
| Open Source Code | No | The paper does not include an unambiguous statement about releasing code for the methodology described, nor does it provide a direct link to a source-code repository. |
| Open Datasets | No | The paper describes 'Task Setup' where 'We simulate a simple dataset' and 'Input and target sequence are generated similarly to above'. There is no mention of a publicly available dataset, nor are there any links, DOIs, repositories, or formal citations provided for accessing the simulated data. |
| Dataset Splits | No | The paper provides 'Training parameters' but does not specify exact training/validation/test dataset splits, percentages, or absolute sample counts. It describes custom data generation but not how it's partitioned for training, validation, or testing. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments. |
| Software Dependencies | No | The paper mentions 'torch.rand' for matrix initialization, implying the use of PyTorch, but does not provide a specific version number for PyTorch or any other software dependencies like the ADAM optimizer. |
| Experiment Setup | Yes | Training parameters (across all models): batch size:200, number of batches: 10, optimiser: ADAM, learning rate: 1e-3, Number of different random seeds: 10. For expanding attention the hyperparameters were set as α = .1, β = .9. |