reproducibilityindex.ai

Quadtree Attention for Vision Transformers

Authors: Shitao Tang, Jiahui Zhang, Siyu Zhu, Ping Tan

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experiment our quadtree transformer with four representative tasks, including feature matching, stereo, image classiﬁcation, and object detection. The ﬁrst two tasks require cross attention to fuse information across different images, while the latter two involve only self-attention. We implement our quadtree transformer using Py Torch and CUDA kernels. More implementation details are provided in Appendix B.
Researcher Affiliation	Collaboration	1Simon Fraser University, 2Alibaba A.I. Lab shitaot@sfu.ca, zjhthu@gmail.com, siting.zsy@alibaba-inc.com, pingtan@sfu.ca
Pseudocode	No	The paper describes the Quad Tree Attention mechanism and its two architectures (Quad Tree-A and Quad Tree-B) with equations and descriptive text, but it does not include a formal pseudocode block or algorithm listing.
Open Source Code	Yes	The codes are available at https://github.com/Tangshitao/QuadtreeAttention.
Open Datasets	Yes	We experiment on Scan Net (Dai et al., 2017) with 1,513 scans. In order to accelerate training, we design the Lo FTR-lite setting, which uses half of the feature channels of Lo FTR and 453 training scans. We experiment on the Scene Flow Flying Things3D (Mayer et al., 2016) synthetic dataset, which contains 25,466 images with a resolution of 960 540. We evaluate image classiﬁcation on the Image Net-1K dataset (Deng et al., 2009), which consists of 1.28M training images and 50K validation images from 1,000 categories. We experiment on the COCO dataset. All models are trained on COCO train 2017 (118k images) and evaluated on val 2017 (5k images).
Dataset Splits	Yes	Image Net-1K dataset (Deng et al., 2009), which consists of 1.28M training images and 50K validation images from 1,000 categories. All models are trained on COCO train 2017 (118k images) and evaluated on val 2017 (5k images).
Hardware Specification	No	The paper mentions training on '8 GPUs' for image classification but does not specify the model or type of GPUs or any other hardware components (CPU, RAM, etc.).
Software Dependencies	No	The paper mentions using 'Py Torch and CUDA kernels' but does not specify any version numbers for these software dependencies.
Experiment Setup	Yes	We train both Lo FTR-lite and Lo FTR for 30 epochs with batch size 8. For quadtree transformer, we build pyramids of three levels with the coarsest resolution at 15 20 pixels. We set the parameter K to 8 at the ﬁnest level, and double it at coarser levels. We follow STTR to train the network, with 15 epochs of Adam W optimizer. One Cycle learning rate scheduler is used with a leaning rate of 6e-4 and a batch size of 8. We crop and resize the input images to 224 224 pixels and train the model with a mini-batch of 128. All models are trained for 300 epochs from scratch on 8 GPUs. All the other training settings are the same as in (Wang et al., 2021c). We initialize the quadtree backbone with the weights pre-trained on Image Net. We adopt the same setting as PVTv2, training the model with a batch size of 16 and Adam W optimizer with an initial learning rate of 1 10 4 for 12 epochs.