UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes
Authors: Alexander Kolesnikov, André Susano Pinto, Lucas Beyer, Xiaohua Zhai, Jeremiah Harmsen, Neil Houlsby
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of UVi M on three diverse and challenging vision tasks: panoptic segmentation, depth prediction and image colorization, where we achieve competitive and near state-of-the-art results. Our experimental results suggest that UVi M is a promising candidate for a unified modeling approach in computer vision. Quantitative results are presented in Table 1 and qualitative results are in Appendix A. |
| Researcher Affiliation | Industry | Alexander Kolesnikov André Susano Pinto Lucas Beyer Xiaohua Zhai Jeremiah Harmsen Neil Houlsby Google Research, Brain Team Zürich {akolesnikov,andresp,lbeyer,xzhai,jeremiah,neilhoulsby}@google.com |
| Pseudocode | No | The paper describes procedures but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The full UVi M code is publicly available in the big_vision codebase.2 https://github.com/google-research/big_vision |
| Open Datasets | Yes | We train on the COCO panoptic 2017 [31, 23] dataset. It has approximately 118 000 training images and 5 000 official validation images which we use for test. All hyper-parameters were selected on 4 096 images held out from the training data. For training we use Image Net [39] training split consisting of 1.2M examples. We train on the NYU Depth V2 [43] dataset consisting of 47 584 training examples captured across 280 indoor scenes, and 654 official validation examples. |
| Dataset Splits | Yes | We train on the COCO panoptic 2017 [31, 23] dataset. It has approximately 118 000 training images and 5 000 official validation images which we use for test. All hyper-parameters were selected on 4 096 images held out from the training data. For hyper-parameter selection we hold out all examples from 14 scenes from the training set. |
| Hardware Specification | Yes | For all our experiments we use Google Cloud TPU-v3 hardware. A phase I training run for panoptic segmentation requires 1.9k TPU-v3 hours, while a phase II training run requires 0.9k TPU-v3 hours. |
| Software Dependencies | No | The paper mentions software components like Adafactor, Adam, and BERT-Base architecture but does not specify version numbers for any programming languages or libraries. |
| Experiment Setup | Yes | Experimental setup for stage I. We parameterize the base model f and the restricted oracle Ωwith Vi T-B/16 model. For Ωwe use 6 layers instead of 12, as in the initial experiments we observed that a relatively small capacity is sufficient. Both models are trained from scratch. The input and output resolution during stage I for all tasks is 512 512. For optimization we use a variant of Adafactor [42] introduced in [56]. Due to differences in dataset size, we tune the learning rate and number of epochs per task, but all other hyperparameters are the same. For the guiding code, z Z, produced by the restricted oracle, we use a sequence length of 256 with 4096 dictionary entries. Experimental setup for stage II. The language model consists of the encoder and autoregressive decoder. For the encoder, by default, we use the Vi T-L/16 model. We initialize the encoder with the Image Net-21k [39] pre-trained model from [46]. For the decoder, we use the Vi T-B model. Note, that there is no initial patch projection, as it uses guiding code z as autoregressive input, this is equivalent to the standard BERT-Base [8] architecture. As in the stage I, the input and output resolution for all tasks is 512 512, except for the panoptic task, where we use a higher input resolution of 1280 1280. For optimization, we use the same optimizer as in Stage I. For all tasks, we use a base learning rate of 0.001 with cosine decay and, additionally, apply a 10-fold reduction for the pre-trained encoder weights. Due to differences in dataset size, the number of epochs is tuned per task. |