Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

Authors: Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Hiera on a variety of tasks for image and video recognition.
Researcher Affiliation Collaboration 1Meta AI, FAIR 2Georgia Tech 3Johns Hopkins University.
Pseudocode No The paper includes architectural diagrams (e.g., Figures 2, 4, 5, 6) and descriptions of processes, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Our code and models are available at https://github.com/facebookresearch/hiera.
Open Datasets Yes We ablate using our large model, Hiera-L, to ensure that our method works at scale. We evaluate performance by finetuning. All metrics are top-1 accuracies using standard evaluation protocols a single (resized) center crop on IN1K and 3 spatial 5 temporal views on K400. Image Net-1K (IN1K, Deng et al. (2009)) and Kinetics-400 (K400, Kay et al. (2017)).
Dataset Splits Yes We evaluate performance by finetuning. All metrics are top-1 accuracies using standard evaluation protocols a single (resized) center crop on IN1K and 3 spatial 5 temporal views on K400. For each ablation, we use 400 (800) epochs of sparse MAE pretraining for IN1K (K400) and 50 epochs of dense finetuning unless otherwise noted.
Hardware Specification Yes All benchmarks in this paper are on an A100 with fp16 (as this setting is most useful in practice) unless noted otherwise. We use an NVIDIA A100 40GB GPU, PyTorch v1.12.1 and CUDA 11.4 to benchmark speed for all baselines and our approach, unless otherwise mentioned.
Software Dependencies Yes We use an NVIDIA A100 40GB GPU, PyTorch v1.12.1 and CUDA 11.4 to benchmark speed for all baselines and our approach, unless otherwise mentioned.
Experiment Setup Yes Table 11: Settings for Kinetics-400, -600, -700. (a) Pretraining (e.g., 'optimizer Adam W', 'learning rate 8e-4', 'warmup epochs 120', 'epochs 800 / 1600 / 3200', 'batch size 512', 'mask ratio 0.9', 'drop path 0.1'). (b) Finetuning. Similar detailed settings are provided in Tables 12, 13, 14, and 15 for SSv2, AVA, ImageNet-1K, and COCO respectively.