Quadtree Attention for Vision Transformers
Authors: Shitao Tang, Jiahui Zhang, Siyu Zhu, Ping Tan
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experiment our quadtree transformer with four representative tasks, including feature matching, stereo, image classification, and object detection. The first two tasks require cross attention to fuse information across different images, while the latter two involve only self-attention. We implement our quadtree transformer using Py Torch and CUDA kernels. More implementation details are provided in Appendix B. |
| Researcher Affiliation | Collaboration | 1Simon Fraser University, 2Alibaba A.I. Lab shitaot@sfu.ca, zjhthu@gmail.com, siting.zsy@alibaba-inc.com, pingtan@sfu.ca |
| Pseudocode | No | The paper describes the Quad Tree Attention mechanism and its two architectures (Quad Tree-A and Quad Tree-B) with equations and descriptive text, but it does not include a formal pseudocode block or algorithm listing. |
| Open Source Code | Yes | The codes are available at https://github.com/Tangshitao/QuadtreeAttention. |
| Open Datasets | Yes | We experiment on Scan Net (Dai et al., 2017) with 1,513 scans. In order to accelerate training, we design the Lo FTR-lite setting, which uses half of the feature channels of Lo FTR and 453 training scans. We experiment on the Scene Flow Flying Things3D (Mayer et al., 2016) synthetic dataset, which contains 25,466 images with a resolution of 960 540. We evaluate image classification on the Image Net-1K dataset (Deng et al., 2009), which consists of 1.28M training images and 50K validation images from 1,000 categories. We experiment on the COCO dataset. All models are trained on COCO train 2017 (118k images) and evaluated on val 2017 (5k images). |
| Dataset Splits | Yes | Image Net-1K dataset (Deng et al., 2009), which consists of 1.28M training images and 50K validation images from 1,000 categories. All models are trained on COCO train 2017 (118k images) and evaluated on val 2017 (5k images). |
| Hardware Specification | No | The paper mentions training on '8 GPUs' for image classification but does not specify the model or type of GPUs or any other hardware components (CPU, RAM, etc.). |
| Software Dependencies | No | The paper mentions using 'Py Torch and CUDA kernels' but does not specify any version numbers for these software dependencies. |
| Experiment Setup | Yes | We train both Lo FTR-lite and Lo FTR for 30 epochs with batch size 8. For quadtree transformer, we build pyramids of three levels with the coarsest resolution at 15 20 pixels. We set the parameter K to 8 at the finest level, and double it at coarser levels. We follow STTR to train the network, with 15 epochs of Adam W optimizer. One Cycle learning rate scheduler is used with a leaning rate of 6e-4 and a batch size of 8. We crop and resize the input images to 224 224 pixels and train the model with a mini-batch of 128. All models are trained for 300 epochs from scratch on 8 GPUs. All the other training settings are the same as in (Wang et al., 2021c). We initialize the quadtree backbone with the weights pre-trained on Image Net. We adopt the same setting as PVTv2, training the model with a batch size of 16 and Adam W optimizer with an initial learning rate of 1 10 4 for 12 epochs. |