Perceiving Longer Sequences With Bi-Directional Cross-Attention Transformers

Authors: Markus Hiller, Krista A. Ehinger, Tom Drummond

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that Bi XT models outperform larger competitors by leveraging longer sequences more efficiently on vision tasks like classification and segmentation, and perform on par with full Transformer variants on sequence modeling and document retrieval but require 28% fewer FLOPs and are up to 8.4 faster.
Researcher Affiliation Academia Markus Hiller, Krista A. Ehinger, and Tom Drummond School of Computing and Information Systems The University of Melbourne, Australia m.hiller@unimelb.edu.au
Pseudocode No No pseudocode or algorithm blocks found.
Open Source Code Yes Code and models are publicly available at https://github.com/mrkshllr/Bi XT.
Open Datasets Yes Image Net1K [36], Model Net40 [49], ADE20K [56], Shape Net Part [52], LRA benchmark [40], Long-List Ops [26], AAN [35].
Dataset Splits Yes We pick the best model based on validation accuracy, and report the mean and (unbiased) standard deviation across these models evaluated on the withheld test set in Table A8.
Hardware Specification Yes Samples per second indicate empirical throughput at inference time for varying specified batch sizes bs (using one NVIDIA A100). We train our models using a single A100 GPU (80Gb).
Software Dependencies No We implemented our models in Py Torch [30] using the timm library, and will release all code and pretrained models. We further made use of the mmsegmentation library [8] for the semantic segmentation experiments. Point cloud experiments were built on the publicly released code base from Ma et al. [25].
Experiment Setup Yes Hyperparameter choice for the default Image Net experiments: Bi XT with 64 latents, 12 layers, embedding dimension for latents and tokens 192 paired with 6 heads (head dimension 32) learning rate 2.5e 3, weight decay 0.05 and lambc optimizer, as well as cosine learning rate scheduler with linear warmup; stochastic dropout on self-attention and cross-attention 0.1 for all tiny models.