Perceiving Longer Sequences With Bi-Directional Cross-Attention Transformers
Authors: Markus Hiller, Krista A. Ehinger, Tom Drummond
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that Bi XT models outperform larger competitors by leveraging longer sequences more efficiently on vision tasks like classification and segmentation, and perform on par with full Transformer variants on sequence modeling and document retrieval but require 28% fewer FLOPs and are up to 8.4 faster. |
| Researcher Affiliation | Academia | Markus Hiller, Krista A. Ehinger, and Tom Drummond School of Computing and Information Systems The University of Melbourne, Australia m.hiller@unimelb.edu.au |
| Pseudocode | No | No pseudocode or algorithm blocks found. |
| Open Source Code | Yes | Code and models are publicly available at https://github.com/mrkshllr/Bi XT. |
| Open Datasets | Yes | Image Net1K [36], Model Net40 [49], ADE20K [56], Shape Net Part [52], LRA benchmark [40], Long-List Ops [26], AAN [35]. |
| Dataset Splits | Yes | We pick the best model based on validation accuracy, and report the mean and (unbiased) standard deviation across these models evaluated on the withheld test set in Table A8. |
| Hardware Specification | Yes | Samples per second indicate empirical throughput at inference time for varying specified batch sizes bs (using one NVIDIA A100). We train our models using a single A100 GPU (80Gb). |
| Software Dependencies | No | We implemented our models in Py Torch [30] using the timm library, and will release all code and pretrained models. We further made use of the mmsegmentation library [8] for the semantic segmentation experiments. Point cloud experiments were built on the publicly released code base from Ma et al. [25]. |
| Experiment Setup | Yes | Hyperparameter choice for the default Image Net experiments: Bi XT with 64 latents, 12 layers, embedding dimension for latents and tokens 192 paired with 6 heads (head dimension 32) learning rate 2.5e 3, weight decay 0.05 and lambc optimizer, as well as cosine learning rate scheduler with linear warmup; stochastic dropout on self-attention and cross-attention 0.1 for all tiny models. |