Pix2seq: A Language Modeling Framework for Object Detection
Authors: Ting Chen, Saurabh Saxena, Lala Li, David J. Fleet, Geoffrey Hinton
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the proposed method on the MS-COCO 2017 detection dataset (Lin et al., 2014), containing 118k training images and 5k validation images. To compare with DETR and Faster R-CNN, we report average precision (AP), an integral metric over multiple thresholds, on validation set at the last training epoch. |
| Researcher Affiliation | Industry | Ting Chen, Saurabh Saxena, Lala Li, David J. Fleet, Geoffrey Hinton Google Research, Brain Team Correspondence to: iamtingchen@google.com |
| Pseudocode | Yes | Algorithm 1 Quantization of (normalized) coordinates |
| Open Source Code | Yes | Code and checkpoints available at https://github.com/google-research/pix2seq. |
| Open Datasets | Yes | We evaluate the proposed method on the MS-COCO 2017 detection dataset (Lin et al., 2014)... |
| Dataset Splits | Yes | We evaluate the proposed method on the MS-COCO 2017 detection dataset (Lin et al., 2014), containing 118k training images and 5k validation images. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, or cloud compute instances) used for training or running experiments. |
| Software Dependencies | No | The paper mentions common deep learning components like ResNet and Transformer architectures but does not specify version numbers for any software, libraries, or frameworks used in the experiments (e.g., Python, PyTorch, TensorFlow, CUDA versions). |
| Experiment Setup | Yes | For training from scratch, we follow (Carion et al., 2020) using a Res Net backbone (He et al., 2016), followed by 6 layers of transformer encoder and 6 layers of (causal) transformer decoder (Vaswani et al., 2017)... We resize images (with a fixed aspect ratio) so the longer side is 1333 pixels. For sequence construction, we use 2000 quantization bins... The model is trained for 300 epochs with a batch size of 128... We use Adam W optimizer (Kingma & Ba, 2014; Loshchilov & Hutter, 2018) with a learning rate of 0.003 and weight decay of 0.05. |