Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Pix2seq: A Language Modeling Framework for Object Detection
Authors: Ting Chen, Saurabh Saxena, Lala Li, David J. Fleet, Geoffrey Hinton
ICLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the proposed method on the MS-COCO 2017 detection dataset (Lin et al., 2014), containing 118k training images and 5k validation images. To compare with DETR and Faster R-CNN, we report average precision (AP), an integral metric over multiple thresholds, on validation set at the last training epoch. |
| Researcher Affiliation | Industry | Ting Chen, Saurabh Saxena, Lala Li, David J. Fleet, Geoffrey Hinton Google Research, Brain Team Correspondence to: EMAIL |
| Pseudocode | Yes | Algorithm 1 Quantization of (normalized) coordinates |
| Open Source Code | Yes | Code and checkpoints available at https://github.com/google-research/pix2seq. |
| Open Datasets | Yes | We evaluate the proposed method on the MS-COCO 2017 detection dataset (Lin et al., 2014)... |
| Dataset Splits | Yes | We evaluate the proposed method on the MS-COCO 2017 detection dataset (Lin et al., 2014), containing 118k training images and 5k validation images. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, or cloud compute instances) used for training or running experiments. |
| Software Dependencies | No | The paper mentions common deep learning components like ResNet and Transformer architectures but does not specify version numbers for any software, libraries, or frameworks used in the experiments (e.g., Python, PyTorch, TensorFlow, CUDA versions). |
| Experiment Setup | Yes | For training from scratch, we follow (Carion et al., 2020) using a Res Net backbone (He et al., 2016), followed by 6 layers of transformer encoder and 6 layers of (causal) transformer decoder (Vaswani et al., 2017)... We resize images (with a fixed aspect ratio) so the longer side is 1333 pixels. For sequence construction, we use 2000 quantization bins... The model is trained for 300 epochs with a batch size of 128... We use Adam W optimizer (Kingma & Ba, 2014; Loshchilov & Hutter, 2018) with a learning rate of 0.003 and weight decay of 0.05. |