Compositional Attention: Disentangling Search and Retrieval

Authors: Sarthak Mittal, Sharath Chandra Raparthy, Irina Rish, Yoshua Bengio, Guillaume Lajoie

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through a series of numerical experiments, we show that it outperforms standard multi-head attention on a variety of tasks, including some out-of-distribution settings.
Researcher Affiliation Academia Sarthak Mittal , Sharath Chandra Raparthy, Irina Rish, Yoshua Bengio, Guillaume Lajoie Mila, Universit e de Montr eal
Pseudocode No The paper describes the mechanism using mathematical equations and computation graphs (Figure 2) but does not include a labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes 1Open-sourced implementation is available at https://github.com/sarthmit/Compositional-Attention
Open Datasets Yes Sort-of-CLEVR (Santoro et al., 2017) is a Visual Question-Answering (VQA) task... We perform experiments on the Wiki Text-103 data corpus (Merity et al., 2016)... We pose the problem of image classification across four different datasets CIFAR10, Fashion MNIST, SVHN and Equilateral Triangle Detection as a multi-task learning setup.
Dataset Splits Yes The corpus consists of 28,475 articles in its training split and 60 in the validation and test split respectively
Hardware Specification No The paper mentions running experiments on 'GPUs' and discusses FLOPs, but does not provide specific details on the hardware used, such as GPU models, CPU types, or memory configurations.
Software Dependencies No The paper mentions using 'fairseq codebase' and 'pytorch-Op Counter' but does not specify version numbers for these or other software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes We use a 4-layered transformer with shared parameters and ablate with transformer dimensions 32, 256 and 512 and ffn dimension as 64, 512, 1024 respectively. We consider baseline with 4 and 8 heads and for the proposed model, we use 4 searches and ablate on 1 4 retrievals. We use 32 dimensions for the retrieval query and key dimensions. We train the model with 0.0001 learning rate for 100 epochs.