reproducibilityindex.ai

Semi-Supervised Learning with Normalizing Flows

Authors: Pavel Izmailov, Polina Kirichenko, Marc Finzi, Andrew Gordon Wilson

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show promising results on a wide range of applications, including AG-News and Yahoo Answers text data, tabular data, and semi-supervised image classiﬁcation. We also show that Flow GMM can discover interpretable structure, provide real-time optimization-free feature visualizations, and specify well calibrated predictive distributions. 5. Experiments We evaluate Flow GMM on a wide range of datasets across different application domains including low-dimensional synthetic data (Section 5.1), text and tabular data (Section 5.2), and image data (Sections 5.3, 5.4).
Researcher Affiliation	Academia	Pavel Izmailov * 1 Polina Kirichenko * 1 Marc Finzi * 1 Andrew Gordon Wilson 1 1New York University. Correspondence to: Pavel Izmailov <pi390@nyu.edu>.
Pseudocode	No	The paper describes the Expectation-Maximization algorithm in Appendix A but does not present it as a formal pseudocode or algorithm block.
Open Source Code	Yes	We also provide code at https://github.com/izmailovpavel/flowgmm.
Open Datasets	Yes	We evaluate Flow GMM on a wide range of datasets across different application domains including low-dimensional synthetic data (Section 5.1), text and tabular data (Section 5.2), and image data (Sections 5.3, 5.4). Along with standard tabular UCI datasets, we also consider text classiﬁcation on AG-News and Yahoo Answers datasets. We evaluate Flow GMM in transfer learning setting on CIFAR-10 semi-supervised image classiﬁcation. We next evaluate the proposed method on semi-supervised image classiﬁcation benchmarks on CIFAR-10, MNIST and SVHN datasets.
Dataset Splits	Yes	For each of the datasets, a separate validation set of size 5k was used to tune hyperparameters. We test Flow GMM calibration on MNIST and CIFAR datasets in the supervised setting. On MNIST we restricted the training set size to 1000 objects, since on the full dataset the model makes too few mistakes which makes evaluating calibration harder. In Table 5, we report negative log likelihood and expected calibration error (ECE, see Guo et al. (2017) for a description of this metric). We can see that re-calibrating variances of the Gaussians in the mixture signiﬁcantly improves both metrics and mitigates overconﬁdence.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory specifications) used for running its experiments. It discusses model architectures and training setups, but not the underlying hardware.
Software Dependencies	No	The paper mentions using specific models/algorithms like "Real NVP normalizing ﬂow architecture", "ADAM optimizer", and "BERT transformer model", but does not provide version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used in the implementation.
Experiment Setup	Yes	Throughout training, Gaussian mixture parameters are ﬁxed: the means are initialized randomly from the standard normal distribution and the covariances are set to I. In all experiments, we use the Real NVP normalizing ﬂow architecture. We use the Real NVP architecture with 5 coupling layers, deﬁned by fully-connected shift and scale networks, each with 1 hidden layer of size 512. The tuned learning rates for each of the models that we used for these experiments are shown in Table 6. We train our Flow GMM model with a Real NVP normalizing ﬂow, similar to the architectures used in Papamakarios et al. (2017). Speciﬁcally, the model uses 7 coupling layers, with 1 hidden layer each and 256 hidden units for the UCI datasets but 1024 for text classiﬁcation. UCI models were trained for 50 epochs of unlabeled data and the text datasets were trained for 200 epochs of unlabeled data. We use Adam optimizer (Kingma & Ba, 2014) with learning rate 10-3 for CIFAR-10 and SVHN and 10-4 for MNIST. We train the supervised model for 100 epochs, and semi-supervised models for 1000 passes through the labeled data for CIFAR-10 and SVHN and 3000 passes for MNIST. We use a batch size of 64 and sample 32 labeled and 32 unlabeled data points in each mini-batch. For the consistency loss term (7), we linearly increase the weight from 0 to 1 for the ﬁrst 100 epochs following Athiwaratkun et al. (2019). For Flow GMM and Flow GMM-cons, we re-weight the loss on labeled data by λ = 3 (value tuned on validation in Kingma et al. (2014) on CIFAR-10), as otherwise, we observed that the method underﬁts the labeled data.