Rethinking Pre-training and Self-training

Authors: Barret Zoph, Golnaz Ghiasi, Tsung-Yi Lin, Yin Cui, Hanxiao Liu, Ekin Dogus Cubuk, Quoc Le

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our work studies self-training with a focus on answering the above question. We define a set of control experiments where we use Image Net as additional data with the goal of improving COCO. We vary the amount of labeled data in COCO and the strength of data augmentation as control factors. Our experiments show that as we increase the strength of data augmentation or the amount of labeled data, the value of pre-training diminishes.
Researcher Affiliation Industry Google Research, Brain Team {barretzoph,golnazg,tsungyi,yincui,hanxiaol,cubuk,qvl}@google.com
Pseudocode No The paper describes methods in textual form but does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code and checkpoints for our models are available at https://github.com/tensorflow/tpu/tree/ master/models/official/detection/projects/self_training
Open Datasets Yes We use COCO dataset [58] (118k images) for supervised learning. In selftraining, we experiment with Image Net [59] (1.2M images) and Open Images [60] (1.7M images) as unlabeled datasets. We use the train set (1.5k images) of PASCAL VOC 2012 segmentation dataset [64] for supervised learning.
Dataset Splits Yes For all experiments using different augmentation strengths and datasets sizes, we allow each model to train until it converges (when training longer stops helping or even hurts performance on a held-out validation set). Eff-B7 models (Eff) are trained on PASCAL train set for validation results and train+val for test results.
Hardware Specification No The paper does not explicitly provide specific hardware details (e.g., GPU/CPU models, memory specifications) used for running the experiments. It mentions 'tpu' in the provided GitHub link, but this is part of the repository path, not a statement about the hardware used for their experiments.
Software Dependencies No The paper mentions TensorFlow implicitly through the GitHub link, but it does not specify version numbers for any software, libraries, or frameworks used in the experiments.
Experiment Setup Yes The training batch size is 256 with weight decay 1e-4. The model is trained with learning rate 0.32 and a cosine learning rate decay schedule [62]. At the beginning of training the learning rate is linearly increased over the first 1000 steps from 0.0032 to 0.32. For Semantic Segmentation: The learning rate is set to 0.08 for Efficient Net-B7 and 0.2 for Efficient Net-L2 with batch size 256 and weight decay 1e-5.