Time-Consistent Self-Supervision for Semi-Supervised Learning

Authors: Tianyi Zhou, Shengjie Wang, Jeff Bilmes

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments, we show that TC-SSL outperforms the very recent Mix Match and other SSL approaches on three datasets (CIFAR10, CIFAR100, and STL10) under various labeled-unlabeled splittings and significantly improves SSL efficiency, i.e., consistently using < 20% training batches of what the best baseline needs.
Researcher Affiliation Academia 1University of Washington, Seattle. Correspondence to: Tianyi Zhou <tianyizh@uw.edu>, Shengjie Wang <wangsj@uw.edu>, Jeff A. Bilmes <bilmes@uw.edu>.
Pseudocode Yes We provide the complete description of TC-SSL in Algorithm 1.
Open Source Code No No explicit statement about releasing source code for their method or a link to a code repository was found.
Open Datasets Yes CIFAR10, CIFAR100 (Krizhevsky & Hinton, 2009), and STL10 (Coates et al., 2011).
Dataset Splits Yes For CIFAR10 experiments, we train a small Wide Res Net-28-2 (28 layers, width factor of 2, 1.5-million parameters) and a large Wide Res Net-28-135 (28 layers, 135 filters per layer, 26-million parameters) for four kinds of labeled/unlabeled/validation random splittings applied to the original training set of CIFAR10, i.e., 500/44500/5000, 1000/44000/5000, 2000/43000/5000, 4000/41000/5000.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) used for running experiments were provided.
Software Dependencies No Only 'Pytorch' is mentioned as a software dependency without a specific version number.
Experiment Setup Yes For TC-SSL in the experiments, we apply T0 = 10 warm starting epochs and T = 680 epochs in total. Note the epoch here refers to one iteration in Algorithm 1 and is different from its meaning in most fully supervised training, where it refers to a full pass of the whole training set. In our case, the training samples in each epoch changes according to our curriculum of kt. We apply SGD with momentum of 0.9 and weight decay of 2 10 5, and use a modified cosine annealing learning rate schedule (Loshchilov & Hutter, 2017) for multiple episodes of increasing length and decaying target learning rate, since it can quickly jump between different local minima on the loss landscape and explore more regions without being trapped by a bad local minima. In particular, we set up 12 episodes with epochs-per-episode starting from 10 (i.e., the warm starting episode) and increasing by 10 after every episode until reaching epoch-680. The learning rate at the beginning and end of the first episode are set to 0.2 and 0.02, respectively. We then multiply each of them by 0.9 after every episode. We do not heavily tune the λ-parameters and γ-parameters. For all experiments, we use λcs = 20/C, λct = 0.2, λce = 1.0, γθ = γc = 0.99, γk = 0.005 (C is the number of classes). For data augmentation, we use Auto Augment (Cubuk et al., 2019a) learned policies for the three datasets followed by Mix Up with the mixing weight sampled from Beta(0.5, 0.5). We initialize k1 = 0.1|U| and θ0 by Pytorch default initialization. We apply all the practical tips detailed in Section 3.3.