A Unified Sequence Interface for Vision Tasks
Authors: Ting Chen, Saurabh Saxena, Lala Li, Tsung-Yi Lin, David J. Fleet, Geoffrey E. Hinton
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on the challenging COCO dataset, and show that it can simultaneously solve all four tasks well, without specialized architectures or loss functions. Our main results are summarized in Table 1, where we report baselines and two variants of our model: (1) single task models where the model is trained on a single task (still with the same architecture and objective function), so each task has its own network weights; and (2) a multi-task model, where a single set of network weights is used for all four tasks. |
| Researcher Affiliation | Industry | Ting Chen Saurabh Saxena Lala Li Tsung-Yi Lin David J. Fleet Geoffrey Hinton Google Research, Brain Team {iamtingchen,srbs,lala}@google.com. Work done at Google. |
| Pseudocode | Yes | Algorithm 1 Training based on data mixing. Algorithm 2 Training based on batch mixing. |
| Open Source Code | Yes | We will opensource our code at https://github.com/google-research/pix2seq. |
| Open Datasets | Yes | We evaluate the proposed method on the widely used MS-COCO 2017 dataset [26], containing 118k training images and 5k validation images, spanning the four tasks we consider. [26] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740 755. Springer, 2014. |
| Dataset Splits | Yes | We evaluate the proposed method on the widely used MS-COCO 2017 dataset [26], containing 118k training images and 5k validation images, spanning the four tasks we consider. |
| Hardware Specification | Yes | It s trained on 32-128 Cloud TPUs. Depending on architectures, and tasks, generally takes 4-12 hours. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | For training on COCO, we use a batch size of 128 images, a learning rate of 1e 4, and we train the model for 100 epochs. We use a single vocabulary of 35K, with 32K text tokens, 1K coordinate quantization bins, and a few other class labels. We use a maximum sequence length of 512. Our backbone model is pretrained with 640 640 image size, and is fine-tuned in 640 640 or 1024 1024 resolutions. We use a mixed weighting of 0.1782, 0.7128, 0.099, 0.01 for object detection, instance segmentation, image captioning, and keypoint detection respectively. |