There Are Many Consistent Explanations of Unlabeled Data: Why You Should Average

Authors: Ben Athiwaratkun, Marc Finzi, Pavel Izmailov, Andrew Gordon Wilson

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On a thorough empirical study we show that this procedure achieves the best known semi-supervised results on consequential benchmarks. In particular: We show in Section 3.1 that a simplified model implicitly regularizes the norm of the Jacobian of the network outputs with respect to both its inputs and its weights, which in turn encourages flatter solutions. Both the reduced Jacobian norm and flatness of solutions have been related to generalization in the literature (Sokoli c et al., 2017; Novak et al., 2018; Chaudhari et al., 2016; Schmidhuber and Hochreiter, 1997; Keskar et al., 2017; Izmailov et al., 2018). Interpolating between the weights corresponding to different epochs of training we demonstrate that the solutions of and Mean Teacher models are indeed flatter along these directions (Figure 1b). In Section 3.2, we compare the training trajectories of the , Mean Teacher, and supervised models and find that the distances between the weights corresponding to different epochs are much larger for the consistency based models. The error curves of consistency models are also wider (Figure 1b), which can be explained by the flatness of the solutions discussed in section 3.1. Further we observe that the predictions of the SGD iterates can differ significantly between different iterations of SGD. We observe that for consistency-based methods, SGD does not converge to a single point but continues to explore many solutions with high distances apart. Inspired by this observation, we propose to average the weights corresponding to SGD iterates, or ensemble the predictions of the models corresponding to these weights. Averaging weights of SGD iterates compensates for larger steps, stabilizes SGD trajectories and obtains a solution that is centered in a flat region of the loss (as a function of weights). Further, we show that the SGD iterates correspond to models with diverse predictions using weight averaging or ensembling allows us to make use of the improved diversity and obtain a better solution compared to the SGD iterates. In Section 3.3 we demonstrate that both ensembling predictions and averaging weights of the networks corresponding to different training epochs significantly improve generalization performance and find that the improvement is much larger for the and Mean Teacher models compared to supervised training. We find that averaging weights provides similar or improved accuracy compared to ensembling, while offering the computational benefits and convenience of working with a single model. Thus, we focus on weight averaging for the remainder of the paper. Motivated by our observations in Section 3 we propose to apply Stochastic Weight Averaging (SWA) (Izmailov et al., 2018) to the and Mean Teacher models. Based on our results in Section 3.3 we propose several modifications to SWA in Section 4. In particular, we propose fast-SWA, which (1) uses a learning rate schedule with longer cycles to increase the distance between the weights that are averaged and the diversity of the corresponding predictions; and (2) averages weights of multiple points within each cycle (while SWA only averages weights corresponding to the lowest values of the learning rate within each cycle). In Section 5, we show that fast-SWA converges to a good solution much faster than SWA. Applying weight averaging to the and Mean Teacher models we improve the best reported results on CIFAR-10 for 1k, 2k, 4k and 10k labeled examples, as well as on CIFAR-100 with 10k labeled examples. For example, we obtain 5.0% error on CIFAR-10 with only 4k labels, improving the best result reported in the literature (Tarvainen and Valpola, 2017) by 1.3%. We also apply weight averaging to a state-of-the-art domain adaptation technique (French et al., 2018) closely related to the Mean Teacher model and improve the best reported results on domain adaptation from CIFAR-10 to STL from 19.9% to 16.8% error.
Researcher Affiliation Academia Ben Athiwaratkun, Marc Finzi, Pavel Izmailov, Andrew Gordon Wilson {pa338, maf388, pi49, andrew}@cornell.edu Cornell University
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes We release our code at https://github.com/benathi/fastswa-semi-sup
Open Datasets Yes We evaluate the proposed fast-SWA method using the and MT models on the CIFAR-10 dataset (Krizhevsky). We also analyze the effect of using the Tiny Images dataset (Torralba et al., 2008) as an additional source of unlabeled data.
Dataset Splits Yes To determine the initial learning rate 0 and the cycle length c we used a separate validation set of size 5000 taken from the unlabeled data.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments (e.g., GPU models, CPU types, or cloud instance specifications).
Software Dependencies No The paper mentions 'Py Torch' and 'SGD instead of Adam' (referring to optimizers), but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes We consider two different schedules. In the short schedule we set the cosine half-period 0 = 210 and training length = 180, following the schedule used in Tarvainen and Valpola (2017) in Shake-Shake experiments. For our Shake-Shake experiments we also report results with long schedule where we set = 1800, 0 = 1500 following Gastaldi (2017). To determine the initial learning rate 0 and the cycle length c we used a separate validation set of size 5000 taken from the unlabeled data. For the short schedule we use cycle length c = 30 and average models once every k = 3 epochs. For long schedule we use c = 200, k = 20. In all experiments we use stochastic gradient descent optimizer with Nesterov momentum (Loshchilov and Hutter, 2016). In fast-SWA we average every the weights of the models corresponding to every third epoch. In the model, we back-propagate the gradients through the student side only (as opposed to both sides in (Laine and Aila, 2016)). For Mean Teacher we use = 0.97 decay rate in the Exponential Moving Average (EMA) of the student s weights. For all other hyper-parameters we reuse the values from Tarvainen and Valpola (2017) unless mentioned otherwise. Like in Tarvainen and Valpola (2017), we use k k2 for divergence in the consistency loss. Similarly, we ramp up the consistency cost λ over the first 5 epochs from 0 up to it s maximum value of 100 as done in Tarvainen and Valpola (2017). We use cosine annealing learning rates with no learning rate ramp up, unlike in the original MT implementation (Tarvainen and Valpola, 2017). CIFAR-10 CNN Experiments We use a total batch size of 100 for CNN experiments with a labeled batch size of 50. We use the maximum learning rate 0 = 0.1. CIFAR-10 Res Net + Shake-Shake We use a total batch size of 128 for Res Net experiments with a labeled batch size of 31. We use the maximum learning rate 0 = 0.05 for CIFAR-10. This applies for both the short and long schedules. CIFAR-100 CNN Experiments We use a total batch size of 128 with a labeled batch size of 31 for 10k and 50k label settings. For the settings 50k+500k and 50k+237k , we use a labeled batch size of 64. We also limit the number of unlabeled images used in each epoch to 100k images. We use the maximum learning rate 0 = 0.1. CIFAR-100 Res Net + Shake-Shake We use a total batch size of 128 for Res Net experiments with a labeled batch size of 31 in all label settings. For the settings 50k+500k and 50k+237k , we also limit the number of unlabeled images used in each epoch to 100k images. We use the maximum learning rate 0 = 0.1. This applies for both the short and long schedules.