Can semi-supervised learning use all the data effectively? A lower bound perspective
Authors: Alexandru Tifrea, Gizem Yüce, Amartya Sanyal, Fanny Yang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Nevertheless, in our real-world experiments, SSL algorithms can often outperform UL and SL algorithms. In summary, our work suggests that while it is possible to prove the performance gains of SSL algorithms, this would require careful tracking of constants in the theoretical analysis. |
| Researcher Affiliation | Academia | Alexandru T, ifrea ETH Zurich alexandru.tifrea@inf.ethz.ch Gizem Yüce EPFL gizem.yuce@epfl.ch Amartya Sanyal Max Planck Institute for Intelligent Systems, Tübingen amsa@di.ku.dk Fanny Yang ETH Zurich fan.yan@inf.ethz.ch |
| Pseudocode | Yes | Algorithm 1: UL+ algorithms AUL+; Algorithm 2: SSL-S algorithm; Algorithm 3: SSL-W algorithm |
| Open Source Code | No | The paper uses existing libraries like Scikit-Learn for implementations ('We use logistic regression from Scikit-Learn [28]', 'We use an implementation of Expectation-Maximization from the Scikit-Learn library') but does not state that its own specific code for the described methodology is publicly available. |
| Open Datasets | Yes | We consider 10 binary classification real-world datasets: five from the Open ML repository [37] and five 2-class subsets of the MNIST dataset [13]. |
| Dataset Splits | Yes | We split each dataset in a test set, a validation set and a training set. The (unlabeled) validation set and the test set have 1000 labeled samples each. |
| Hardware Specification | No | The paper does not specify any hardware details such as GPU/CPU models, memory, or cloud computing resources used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'Scikit-Learn [28]' for its algorithms (e.g., 'We use logistic regression from Scikit-Learn [28]', 'Expectation-Maximization from the Scikit-Learn library') but does not provide specific version numbers for these software components, which is required for reproducibility. |
| Experiment Setup | Yes | The labeled and unlabeled set sizes are set to 20 and 2000, respectively. The unlabeled set size is fixed to 5000 for the synthetic experiments and 4000 for the real-world datasets. The size of the labeled set nl is varied in each experiment. For each dataset, we draw a different labeled subset 20 times and report the average and the standard deviation of the error gap (or the error) over these runs. The (unlabeled) validation set and the test set have 1000 labeled samples each. We use the validation set to select the ridge penalty for SL... The best confidence threshold for the pseudolabels is selected using the validation set. Moreover, the optimal weight for SSL-W is also chosen with the help of the validation set. |