xT: Nested Tokenization for Larger Context in Large Images

Authors: Ritwik Gupta, Shufan Li, Tyler Zhu, Jitendra Malik, Trevor Darrell, Karttikeya Mangalam

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We select a set of benchmark datasets across classic vision tasks which accurately reflect a vision model s ability to understand truly large images and incorporate fine details over large scales and assess our method s improvement on them. x T is a streaming, two-stage architecture that adapts existing vision backbones and long sequence language models to effectively model large images without quadratic memory growth. We are able to increase accuracy by up to 8.6% on challenging classification tasks and F1 score by 11.6 on context-dependent segmentation on images as large as 29,000 x 29,000 pixels.
Researcher Affiliation Academia 1University of California, Berkeley 2University of California, Los Angeles 3Princeton University.
Pseudocode No The paper describes the architecture and method in detail but does not include any pseudocode or algorithm blocks.
Open Source Code Yes Code and pre-trained weights are available at https://github.com/bair-climate-initiative/x T.
Open Datasets Yes We focus on i Naturalist 2018 (Van Horn et al., 2018) for classification, x View3-SAR (Paolo et al., 2022) for segmentation, and Cityscapes (Cordts et al., 2016) for detection. Since i Naturalist 2018 is a massive dataset, we focus on the Reptilia super-class, the most challenging subset available in the benchmark (Van Horn et al., 2018).
Dataset Splits No The paper mentions 'report the validation numbers in Table 3' and describes training schedules, but it does not explicitly state the specific percentages or counts for training/validation/test splits.
Hardware Specification Yes For our comparisons, we use 40GB Nvidia A100 GPUs.
Software Dependencies No The paper mentions optimizers and learning rate schedules (e.g., 'Adam W optimizer (β1 = 0.9, β2 = 0.999, ϵ = 1 10 8) using cosine learning rate decay schedule'), but it does not specify software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x).
Experiment Setup Yes We train end-to-end on the Reptilia subset of i Naturalist 2018 for 100 epochs using the Adam W optimizer (β1 = 0.9, β2 = 0.999, ϵ = 1 10 8) using cosine learning rate decay schedule. Swin-T, Swin-S, Hiera-B/+, and their x T variants use a base learning rate of 1 10 4 while Swin B, Swin-L, and their x T variants use a base learning rate of 1 10 5.