Compositional Attention: Disentangling Search and Retrieval
Authors: Sarthak Mittal, Sharath Chandra Raparthy, Irina Rish, Yoshua Bengio, Guillaume Lajoie
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through a series of numerical experiments, we show that it outperforms standard multi-head attention on a variety of tasks, including some out-of-distribution settings. |
| Researcher Affiliation | Academia | Sarthak Mittal , Sharath Chandra Raparthy, Irina Rish, Yoshua Bengio, Guillaume Lajoie Mila, Universit e de Montr eal |
| Pseudocode | No | The paper describes the mechanism using mathematical equations and computation graphs (Figure 2) but does not include a labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | Yes | 1Open-sourced implementation is available at https://github.com/sarthmit/Compositional-Attention |
| Open Datasets | Yes | Sort-of-CLEVR (Santoro et al., 2017) is a Visual Question-Answering (VQA) task... We perform experiments on the Wiki Text-103 data corpus (Merity et al., 2016)... We pose the problem of image classification across four different datasets CIFAR10, Fashion MNIST, SVHN and Equilateral Triangle Detection as a multi-task learning setup. |
| Dataset Splits | Yes | The corpus consists of 28,475 articles in its training split and 60 in the validation and test split respectively |
| Hardware Specification | No | The paper mentions running experiments on 'GPUs' and discusses FLOPs, but does not provide specific details on the hardware used, such as GPU models, CPU types, or memory configurations. |
| Software Dependencies | No | The paper mentions using 'fairseq codebase' and 'pytorch-Op Counter' but does not specify version numbers for these or other software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | We use a 4-layered transformer with shared parameters and ablate with transformer dimensions 32, 256 and 512 and ffn dimension as 64, 512, 1024 respectively. We consider baseline with 4 and 8 heads and for the proposed model, we use 4 searches and ablate on 1 4 retrievals. We use 32 dimensions for the retrieval query and key dimensions. We train the model with 0.0001 learning rate for 100 epochs. |