reproducibilityindex.ai

Perceiving Longer Sequences With Bi-Directional Cross-Attention Transformers

Authors: Markus Hiller, Krista A. Ehinger, Tom Drummond

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that Bi XT models outperform larger competitors by leveraging longer sequences more efficiently on vision tasks like classification and segmentation, and perform on par with full Transformer variants on sequence modeling and document retrieval but require 28% fewer FLOPs and are up to 8.4 faster.
Researcher Affiliation	Academia	Markus Hiller, Krista A. Ehinger, and Tom Drummond School of Computing and Information Systems The University of Melbourne, Australia m.hiller@unimelb.edu.au
Pseudocode	No	No pseudocode or algorithm blocks found.
Open Source Code	Yes	Code and models are publicly available at https://github.com/mrkshllr/Bi XT.
Open Datasets	Yes	Image Net1K [36], Model Net40 [49], ADE20K [56], Shape Net Part [52], LRA benchmark [40], Long-List Ops [26], AAN [35].
Dataset Splits	Yes	We pick the best model based on validation accuracy, and report the mean and (unbiased) standard deviation across these models evaluated on the withheld test set in Table A8.
Hardware Specification	Yes	Samples per second indicate empirical throughput at inference time for varying specified batch sizes bs (using one NVIDIA A100). We train our models using a single A100 GPU (80Gb).
Software Dependencies	No	We implemented our models in Py Torch [30] using the timm library, and will release all code and pretrained models. We further made use of the mmsegmentation library [8] for the semantic segmentation experiments. Point cloud experiments were built on the publicly released code base from Ma et al. [25].
Experiment Setup	Yes	Hyperparameter choice for the default Image Net experiments: Bi XT with 64 latents, 12 layers, embedding dimension for latents and tokens 192 paired with 6 heads (head dimension 32) learning rate 2.5e 3, weight decay 0.05 and lambc optimizer, as well as cosine learning rate scheduler with linear warmup; stochastic dropout on self-attention and cross-attention 0.1 for all tiny models.