MixMatch: A Holistic Approach to Semi-Supervised Learning

Authors: David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, Colin A. Raffel

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally, we show that Mix Match obtains state-of-the-art results on all standard image benchmarks (section 4.2), and reducing the error rate on CIFAR-10 by a factor of 4; We further show in an ablation study that Mix Match is greater than the sum of its parts; We demonstrate in section 4.3 that Mix Match is useful for differentially private learning, enabling students in the PATE framework [36] to obtain new state-of-the-art results that simultaneously strengthen both privacy guarantees and accuracy. In short, Mix Match introduces a unified loss term for unlabeled data that seamlessly reduces entropy while maintaining consistency and remaining compatible with traditional regularization techniques. 4 Experiments We test the effectiveness of Mix Match on standard SSL benchmarks (section 4.2).
Researcher Affiliation Industry David Berthelot Google Research dberth@google.com Nicholas Carlini Google Research ncarlini@google.com Ian Goodfellow Work done at Google ian-academic@mailfence.com Avital Oliver Google Research avitalo@google.com Nicolas Papernot Google Research papernot@google.com Colin Raffel Google Research craffel@google.com
Pseudocode Yes The full Mix Match algorithm is provided in algorithm 1, and a diagram of the label guessing process is shown in fig. 1. Algorithm 1 Mix Match takes a batch of labeled data X and a batch of unlabeled data U and produces a collection X (resp. U ) of processed labeled examples (resp. unlabeled with guessed labels).
Open Source Code Yes We release all code used in our experiments.1 1https://github.com/google-research/mixmatch
Open Datasets Yes We test the effectiveness of Mix Match on standard SSL benchmarks (section 4.2). First, we evaluate the effectiveness of Mix Match on four standard benchmark datasets: CIFAR-10 and CIFAR-100 [24], SVHN [32], and STL-10 [8].
Dataset Splits Yes Our implementation of the model and training procedure closely matches that of [35] (including using 5000 examples to select the hyperparameters), except for the following differences: First, instead of decaying the learning rate, we evaluate models using an exponential moving average of their parameters with a decay rate of 0.999.
Hardware Specification No The paper does not provide specific details about the hardware used for experiments, such as GPU models, CPU types, or memory specifications. It only mentions the size of the models used.
Software Dependencies No The paper mentions that 'Our implementation of the model and training procedure closely matches that of [35]', but it does not specify any software names with version numbers (e.g., Python, TensorFlow, PyTorch, CUDA versions) required for reproduction.
Experiment Setup Yes Since Mix Match combines multiple mechanisms for leveraging unlabeled data, it introduces various hyperparameters specifically, the sharpening temperature T, number of unlabeled augmentations K, α parameter for Beta in Mix Up, and the unsupervised loss weight λU. In practice, semi-supervised learning methods with many hyperparameters can be problematic because cross-validation is difficult with small validation sets [35, 39, 35]. However, we find in practice that most of Mix Match s hyperparameters can be fixed and do not need to be tuned on a per-experiment or per-dataset basis. Specifically, for all experiments we set T = 0.5 and K = 2. Further, we only change α and λU on a per-dataset basis; we found that α = 0.75 and λU = 100 are good starting points for tuning. In all experiments, we linearly ramp up λU to its maximum value over the first 16,000 steps of training as is common practice [44]. ... we apply a weight decay of 0.0004 at each update for the Wide Res Net-28 model. ... For this model, we used a weight decay of 0.0008. We used λU = 75 for CIFAR-10 and λU = 150 for CIFAR-100. ... We used λU = 250. ... For SVHN+Extra we used α = 0.25, λU = 250 and a lower weight decay of 0.000002 ... We used λU = 50.