Unbiased Active Semi-supervised Binary Classification Models

Authors: JooChul Lee, Weidong Ma, Ziyang Wang

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The superiority of our method is demonstrated on synthetic and real data examples. In this section, we conduct numerical studies to assess the performance of the proposed estimator with synthetic data and four real data examples.
Researcher Affiliation Academia Joo Chul Lee1 , Weidong Ma2 and Ziyang Wang3 1Department of Mathematics and Statistics, Auburn University, USA 2Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, USA 3Department of Statistics, University of Connecticut, USA
Pseudocode Yes Algorithm 1 Unbiased Active Semi-supervised Learning Algorithm
Open Source Code Yes The codes used for the numerical studies are available on a Github repository https: //github.com/IJCAI-24/Active Semi Prediction.
Open Datasets Yes We apply the proposed algorithm to four real datasets; 1) Bank Marketing data, 2) SUSY data, 3) Credit Card Clients data and 4) Purchasing Intention data. The datasets are available on the UCI Machine Learning repository: 1) https://archive.ics.uci.edu/ml/datasets/bank+marketing, 2) https://archive.ics.uci.edu/ml/datasets/SUSY, 3) https://archive.ics.uci.edu/ml/datasets/default+of+credit+ card+clients, and 4) https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+ Purchasing+Intention+Dataset.
Dataset Splits No The paper describes using "full training data of size N = 10^5" and selecting "subdata of size 100" in "10 batches" with an initial "uniform samples of size 150." It does not explicitly mention a separate validation dataset or specific train/validation/test splits beyond the batch-wise active sampling process.
Hardware Specification No The paper does not provide specific hardware details such as CPU/GPU models, memory specifications, or cloud computing instances used for running its experiments.
Software Dependencies No The paper mentions using "Natural spline models" but does not specify any software libraries or dependencies with their version numbers that would be needed to replicate the experiment.
Experiment Setup Yes We generate full training data of size N = 10^5 and consider 10 batches. In each batch, we select the subdata of size 100. For the initial values in the proposed algorithm, uniform samples of size 150 are used. The repetition is 300 times. Natural spline models with 2 degree of freedom is considered for the imputation model in each repetition. For initial values and the subdata size, we consider 150 and100 for the first two examples, and 200 and 200 for the other examples, respectively. The total number of batches is 10 and the repetition is 300 times. We build a natural spline model with 2 or 3 degree of freedom for the imputation models.