Pix2seq: A Language Modeling Framework for Object Detection

Authors: Ting Chen, Saurabh Saxena, Lala Li, David J. Fleet, Geoffrey Hinton

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the proposed method on the MS-COCO 2017 detection dataset (Lin et al., 2014), containing 118k training images and 5k validation images. To compare with DETR and Faster R-CNN, we report average precision (AP), an integral metric over multiple thresholds, on validation set at the last training epoch.
Researcher Affiliation Industry Ting Chen, Saurabh Saxena, Lala Li, David J. Fleet, Geoffrey Hinton Google Research, Brain Team Correspondence to: iamtingchen@google.com
Pseudocode Yes Algorithm 1 Quantization of (normalized) coordinates
Open Source Code Yes Code and checkpoints available at https://github.com/google-research/pix2seq.
Open Datasets Yes We evaluate the proposed method on the MS-COCO 2017 detection dataset (Lin et al., 2014)...
Dataset Splits Yes We evaluate the proposed method on the MS-COCO 2017 detection dataset (Lin et al., 2014), containing 118k training images and 5k validation images.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, or cloud compute instances) used for training or running experiments.
Software Dependencies No The paper mentions common deep learning components like ResNet and Transformer architectures but does not specify version numbers for any software, libraries, or frameworks used in the experiments (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup Yes For training from scratch, we follow (Carion et al., 2020) using a Res Net backbone (He et al., 2016), followed by 6 layers of transformer encoder and 6 layers of (causal) transformer decoder (Vaswani et al., 2017)... We resize images (with a fixed aspect ratio) so the longer side is 1333 pixels. For sequence construction, we use 2000 quantization bins... The model is trained for 300 epochs with a batch size of 128... We use Adam W optimizer (Kingma & Ba, 2014; Loshchilov & Hutter, 2018) with a learning rate of 0.003 and weight decay of 0.05.