Positive unlabeled learning via wrapper-based adaptive sampling
Authors: Pengyi Yang, Wei Liu, Jean Yang
IJCAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical studies suggest that Ada Sampling requires very few iterations to accurately distinguish unlabeled pos-itive and negative instances even with very high positive to negative instance ratio in unlabeled data. We next compared Ada Sampling based single and ensemble models with the state-of-the-art bias-based approach and bootstrap sampling approach using Support Vector Machine (SVM) and k Nearest Neighbours (k NN) and a panel of evaluation metrics on several real-world datasets with different ratios of unlabeled positive instances. Our experimental results demonstrate that Ada Sampling significantly improve on classification for both SVM and k NN, and their performance compared favourably to state-of-the-art methods. |
| Researcher Affiliation | Academia | 1Charles Perkins Centre, School of Mathematics and Statistics, University of Sydney, Australia 2Advanced Analytics Institute, University of Technology Sydney, Australia |
| Pseudocode | Yes | Algorithm 1: Ada Sampling for single model; Algorithm 2: Ada Sampling for ensemble of models |
| Open Source Code | Yes | All the data and code are available from the project repository1. 1https://github.com/Pengyi Yang/Ada Sampling |
| Open Datasets | Yes | All these datasets were obtained from UC Irvine Machine Learning Repository [Lichman, 2013] |
| Dataset Splits | Yes | We used a multi-layered repetitive 5-fold cross-validation (CV) procedure to evaluate the performance of each method. Specifically, label information of instances from the positive class were randomly removed. This is repeated 5 times each with a different set of selected instances and comprise the first layer of randomisation. Subsequently, the data is split for 5-fold CV and this is repeated 10 times each with a different split point. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used for running the experiments (e.g., CPU or GPU models, memory, or cloud instance types). |
| Software Dependencies | No | The paper mentions using Support Vector Machine (SVM) and k-nearest neighbour (k NN) classification algorithms and specifies some parameters like 'An SVM with radial basis function kernel (C=1) and a k NN with k=3'. However, it does not provide specific version numbers for any software packages or libraries used (e.g., Python version, scikit-learn version, specific SVM library version). |
| Experiment Setup | Yes | An SVM with radial basis function kernel (C=1) and a k NN with k=3 were used across all positive unlabeled methods as well as the baseline... We set ε to be 0.01, requiring smaller than 1% change in mean prediction probabilities of all instances for the process to terminate. |