Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
Authors: Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Hiera on a variety of tasks for image and video recognition. |
| Researcher Affiliation | Collaboration | 1Meta AI, FAIR 2Georgia Tech 3Johns Hopkins University. |
| Pseudocode | No | The paper includes architectural diagrams (e.g., Figures 2, 4, 5, 6) and descriptions of processes, but it does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and models are available at https://github.com/facebookresearch/hiera. |
| Open Datasets | Yes | We ablate using our large model, Hiera-L, to ensure that our method works at scale. We evaluate performance by finetuning. All metrics are top-1 accuracies using standard evaluation protocols a single (resized) center crop on IN1K and 3 spatial 5 temporal views on K400. Image Net-1K (IN1K, Deng et al. (2009)) and Kinetics-400 (K400, Kay et al. (2017)). |
| Dataset Splits | Yes | We evaluate performance by finetuning. All metrics are top-1 accuracies using standard evaluation protocols a single (resized) center crop on IN1K and 3 spatial 5 temporal views on K400. For each ablation, we use 400 (800) epochs of sparse MAE pretraining for IN1K (K400) and 50 epochs of dense finetuning unless otherwise noted. |
| Hardware Specification | Yes | All benchmarks in this paper are on an A100 with fp16 (as this setting is most useful in practice) unless noted otherwise. We use an NVIDIA A100 40GB GPU, PyTorch v1.12.1 and CUDA 11.4 to benchmark speed for all baselines and our approach, unless otherwise mentioned. |
| Software Dependencies | Yes | We use an NVIDIA A100 40GB GPU, PyTorch v1.12.1 and CUDA 11.4 to benchmark speed for all baselines and our approach, unless otherwise mentioned. |
| Experiment Setup | Yes | Table 11: Settings for Kinetics-400, -600, -700. (a) Pretraining (e.g., 'optimizer Adam W', 'learning rate 8e-4', 'warmup epochs 120', 'epochs 800 / 1600 / 3200', 'batch size 512', 'mask ratio 0.9', 'drop path 0.1'). (b) Finetuning. Similar detailed settings are provided in Tables 12, 13, 14, and 15 for SSv2, AVA, ImageNet-1K, and COCO respectively. |