reproducibilityindex.ai

A Unified Sequence Interface for Vision Tasks

Authors: Ting Chen, Saurabh Saxena, Lala Li, Tsung-Yi Lin, David J. Fleet, Geoffrey E. Hinton

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on the challenging COCO dataset, and show that it can simultaneously solve all four tasks well, without specialized architectures or loss functions. Our main results are summarized in Table 1, where we report baselines and two variants of our model: (1) single task models where the model is trained on a single task (still with the same architecture and objective function), so each task has its own network weights; and (2) a multi-task model, where a single set of network weights is used for all four tasks.
Researcher Affiliation	Industry	Ting Chen Saurabh Saxena Lala Li Tsung-Yi Lin David J. Fleet Geoffrey Hinton Google Research, Brain Team {iamtingchen,srbs,lala}@google.com. Work done at Google.
Pseudocode	Yes	Algorithm 1 Training based on data mixing. Algorithm 2 Training based on batch mixing.
Open Source Code	Yes	We will opensource our code at https://github.com/google-research/pix2seq.
Open Datasets	Yes	We evaluate the proposed method on the widely used MS-COCO 2017 dataset [26], containing 118k training images and 5k validation images, spanning the four tasks we consider. [26] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740 755. Springer, 2014.
Dataset Splits	Yes	We evaluate the proposed method on the widely used MS-COCO 2017 dataset [26], containing 118k training images and 5k validation images, spanning the four tasks we consider.
Hardware Specification	Yes	It s trained on 32-128 Cloud TPUs. Depending on architectures, and tasks, generally takes 4-12 hours.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers.
Experiment Setup	Yes	For training on COCO, we use a batch size of 128 images, a learning rate of 1e 4, and we train the model for 100 epochs. We use a single vocabulary of 35K, with 32K text tokens, 1K coordinate quantization bins, and a few other class labels. We use a maximum sequence length of 512. Our backbone model is pretrained with 640 640 image size, and is ﬁne-tuned in 640 640 or 1024 1024 resolutions. We use a mixed weighting of 0.1782, 0.7128, 0.099, 0.01 for object detection, instance segmentation, image captioning, and keypoint detection respectively.