Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Positive unlabeled learning via wrapper-based adaptive sampling
Authors: Pengyi Yang, Wei Liu, Jean Yang
IJCAI 2017 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical studies suggest that Ada Sampling requires very few iterations to accurately distinguish unlabeled pos-itive and negative instances even with very high positive to negative instance ratio in unlabeled data. We next compared Ada Sampling based single and ensemble models with the state-of-the-art bias-based approach and bootstrap sampling approach using Support Vector Machine (SVM) and k Nearest Neighbours (k NN) and a panel of evaluation metrics on several real-world datasets with different ratios of unlabeled positive instances. Our experimental results demonstrate that Ada Sampling significantly improve on classification for both SVM and k NN, and their performance compared favourably to state-of-the-art methods. |
| Researcher Affiliation | Academia | 1Charles Perkins Centre, School of Mathematics and Statistics, University of Sydney, Australia 2Advanced Analytics Institute, University of Technology Sydney, Australia |
| Pseudocode | Yes | Algorithm 1: Ada Sampling for single model; Algorithm 2: Ada Sampling for ensemble of models |
| Open Source Code | Yes | All the data and code are available from the project repository1. 1https://github.com/Pengyi Yang/Ada Sampling |
| Open Datasets | Yes | All these datasets were obtained from UC Irvine Machine Learning Repository [Lichman, 2013] |
| Dataset Splits | Yes | We used a multi-layered repetitive 5-fold cross-validation (CV) procedure to evaluate the performance of each method. Specifically, label information of instances from the positive class were randomly removed. This is repeated 5 times each with a different set of selected instances and comprise the first layer of randomisation. Subsequently, the data is split for 5-fold CV and this is repeated 10 times each with a different split point. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used for running the experiments (e.g., CPU or GPU models, memory, or cloud instance types). |
| Software Dependencies | No | The paper mentions using Support Vector Machine (SVM) and k-nearest neighbour (k NN) classification algorithms and specifies some parameters like 'An SVM with radial basis function kernel (C=1) and a k NN with k=3'. However, it does not provide specific version numbers for any software packages or libraries used (e.g., Python version, scikit-learn version, specific SVM library version). |
| Experiment Setup | Yes | An SVM with radial basis function kernel (C=1) and a k NN with k=3 were used across all positive unlabeled methods as well as the baseline... We set ε to be 0.01, requiring smaller than 1% change in mean prediction probabilities of all instances for the process to terminate. |