Coneheads: Hierarchy Aware Attention
Authors: Albert Tseng, Tao Yu, Toni Liu, Christopher M. De Sa
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test cone attention on a wide variety of models and tasks and show that it improves task-level performance over dot product attention and other baselines, and is able to match dot-product attention with significantly fewer parameters. Our results suggest that cone attention is an effective way to capture hierarchical relationships when calculating attention. Here, we present an empirical evaluation of cone attention in various attention networks. |
| Researcher Affiliation | Academia | Albert Tseng Cornell University albert@cs.cornell.edu Tao Yu Cornell University tyu@cs.cornell.edu Toni J.B. Liu Cornell University jl3499@cornell.edu Christopher De Sa Cornell University cdesa@cs.cornell.edu |
| Pseudocode | No | The paper describes mathematical formulations and concepts but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing their own source code for cone attention, nor does it provide a direct link to a repository. |
| Open Datasets | Yes | We test GATs on the transductive Cora and inductive multi-graph PPI datasets [17, 13]. We use the fairseq transformer_iwslt_de_en architecture to train a German to English translation model on the IWSLT 14 De-En dataset [20, 9]. We train Dei T-Ti models with 5 million parameters on the Image Net-1K dataset for 300 epochs[28, 7]. We use the fairseq transformer_lm_wiki103 architecture (246.9M parameters) and train models on the Wiki Text-103 language modeling dataset with a block size of 512 tokens [20, 18]. |
| Dataset Splits | No | The paper refers to datasets used for training and mentions epoch counts, but it does not provide specific details on how these datasets were split into training, validation, and test sets (e.g., percentages, sample counts, or citations to predefined splits with such details). |
| Hardware Specification | No | The paper mentions that 'Compute resources were provided by the Cornell G2 Cluster' but does not specify the exact hardware components such as specific GPU or CPU models, or detailed specifications of the cluster used for the experiments. |
| Software Dependencies | No | The paper mentions 'our Py Torch cone attention implementations with torch.compile' but does not provide specific version numbers for PyTorch, torch.compile, or any other software dependencies needed for replication. |
| Experiment Setup | Yes | For each model we test, our experimental procedure consists of changing K in attention and training a new model from scratch. Unless otherwise noted in the appendix, we use the code and training scripts that the authors of each original model released. We assume released hyperparameters are tuned for dot product attention, as these models were state-of-the-art (SOTA) when new. We train Dei T-Ti models with 5 million parameters on the Image Net-1K dataset for 300 epochs[28, 7]. We also train cone and dot product attention for 500 epochs, as we observed that training for more iterations improves performance. We use the fairseq transformer_lm_wiki103 architecture (246.9M parameters) and train models on the Wiki Text-103 language modeling dataset with a block size of 512 tokens [20, 18]. |