Can semi-supervised learning use all the data effectively? A lower bound perspective

Authors: Alexandru Tifrea, Gizem Yüce, Amartya Sanyal, Fanny Yang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Nevertheless, in our real-world experiments, SSL algorithms can often outperform UL and SL algorithms. In summary, our work suggests that while it is possible to prove the performance gains of SSL algorithms, this would require careful tracking of constants in the theoretical analysis.
Researcher Affiliation Academia Alexandru T, ifrea ETH Zurich alexandru.tifrea@inf.ethz.ch Gizem Yüce EPFL gizem.yuce@epfl.ch Amartya Sanyal Max Planck Institute for Intelligent Systems, Tübingen amsa@di.ku.dk Fanny Yang ETH Zurich fan.yan@inf.ethz.ch
Pseudocode Yes Algorithm 1: UL+ algorithms AUL+; Algorithm 2: SSL-S algorithm; Algorithm 3: SSL-W algorithm
Open Source Code No The paper uses existing libraries like Scikit-Learn for implementations ('We use logistic regression from Scikit-Learn [28]', 'We use an implementation of Expectation-Maximization from the Scikit-Learn library') but does not state that its own specific code for the described methodology is publicly available.
Open Datasets Yes We consider 10 binary classification real-world datasets: five from the Open ML repository [37] and five 2-class subsets of the MNIST dataset [13].
Dataset Splits Yes We split each dataset in a test set, a validation set and a training set. The (unlabeled) validation set and the test set have 1000 labeled samples each.
Hardware Specification No The paper does not specify any hardware details such as GPU/CPU models, memory, or cloud computing resources used for running the experiments.
Software Dependencies No The paper mentions using 'Scikit-Learn [28]' for its algorithms (e.g., 'We use logistic regression from Scikit-Learn [28]', 'Expectation-Maximization from the Scikit-Learn library') but does not provide specific version numbers for these software components, which is required for reproducibility.
Experiment Setup Yes The labeled and unlabeled set sizes are set to 20 and 2000, respectively. The unlabeled set size is fixed to 5000 for the synthetic experiments and 4000 for the real-world datasets. The size of the labeled set nl is varied in each experiment. For each dataset, we draw a different labeled subset 20 times and report the average and the standard deviation of the error gap (or the error) over these runs. The (unlabeled) validation set and the test set have 1000 labeled samples each. We use the validation set to select the ridge penalty for SL... The best confidence threshold for the pseudolabels is selected using the validation set. Moreover, the optimal weight for SSL-W is also chosen with the help of the validation set.