Window Attention is Bugged: How not to Interpolate Position Embeddings

Authors: Daniel Bolya, Chaitanya Ryali, Judy Hoffman, Christoph Feichtenhofer

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We study two state-of-the-art methods that have these three components, namely Hiera and Vi TDet, and find that both do indeed suffer from this bug. To fix it, we introduce a simple absolute window position embedding strategy, which solves the bug outright in Hiera and allows us to increase both speed and performance of the model in Vi TDet. We finally combine the two to obtain Hiera Det, which achieves 61.7 box m AP on COCO, making it state-of-the-art for models that only use Image Net-1k pretraining.
Researcher Affiliation Collaboration Daniel Bolya1,2 Chaitanya Ryali2 Judy Hoffman1 Christoph Feichtenhofer2 1 Georgia Tech 2 FAIR, Meta {dbolya,judy}@gatech.edu, {chayryali,feichtenhofer}@meta.com
Pseudocode No No section or figure explicitly labeled 'Pseudocode' or 'Algorithm' was found, nor were any structured code-like procedures presented.
Open Source Code Yes Work done during an internship at Meta. Code and models at https://github.com/facebookresearch/hiera.
Open Datasets Yes We evaluate our absolute win applied to Vi TDet and Hiera Det for object detection and instance segmentation on COCO (Lin et al., 2014) using Detectron2 (Wu et al., 2019) training on train2017 and testing on val2017.
Dataset Splits Yes We evaluate our absolute win applied to Vi TDet and Hiera Det for object detection and instance segmentation on COCO (Lin et al., 2014) using Detectron2 (Wu et al., 2019) training on train2017 and testing on val2017.
Hardware Specification Yes We use a single NVIDIA A100 40GB GPU to benchmark speed for all baselines and our approach.
Software Dependencies Yes Also, for larger image sizes for H models, we had to use both activation checkpointing and torch 2.0 scaled dot product attention to allow the same batch size to fit in GPU memory.
Experiment Setup Yes For finetuning, our hyperparameters follow Ryali et al. (2023) for the most part. We us these same hyperparameters for all image sizes in Tab. 6 and Tab. 8. Note that different hyperparameters might be optimal for higher resolution, but that is out of scope for our experiments here. We list the settings we use for finetuning in Tab. 13 below.