Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Perceiving Longer Sequences With Bi-Directional Cross-Attention Transformers
Authors: Markus Hiller, Krista A. Ehinger, Tom Drummond
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that Bi XT models outperform larger competitors by leveraging longer sequences more efficiently on vision tasks like classification and segmentation, and perform on par with full Transformer variants on sequence modeling and document retrieval but require 28% fewer FLOPs and are up to 8.4 faster. |
| Researcher Affiliation | Academia | Markus Hiller, Krista A. Ehinger, and Tom Drummond School of Computing and Information Systems The University of Melbourne, Australia EMAIL |
| Pseudocode | No | No pseudocode or algorithm blocks found. |
| Open Source Code | Yes | Code and models are publicly available at https://github.com/mrkshllr/Bi XT. |
| Open Datasets | Yes | Image Net1K [36], Model Net40 [49], ADE20K [56], Shape Net Part [52], LRA benchmark [40], Long-List Ops [26], AAN [35]. |
| Dataset Splits | Yes | We pick the best model based on validation accuracy, and report the mean and (unbiased) standard deviation across these models evaluated on the withheld test set in Table A8. |
| Hardware Specification | Yes | Samples per second indicate empirical throughput at inference time for varying specified batch sizes bs (using one NVIDIA A100). We train our models using a single A100 GPU (80Gb). |
| Software Dependencies | No | We implemented our models in Py Torch [30] using the timm library, and will release all code and pretrained models. We further made use of the mmsegmentation library [8] for the semantic segmentation experiments. Point cloud experiments were built on the publicly released code base from Ma et al. [25]. |
| Experiment Setup | Yes | Hyperparameter choice for the default Image Net experiments: Bi XT with 64 latents, 12 layers, embedding dimension for latents and tokens 192 paired with 6 heads (head dimension 32) learning rate 2.5e 3, weight decay 0.05 and lambc optimizer, as well as cosine learning rate scheduler with linear warmup; stochastic dropout on self-attention and cross-attention 0.1 for all tiny models. |