Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Unbiased Active Semi-supervised Binary Classification Models

Authors: JooChul Lee, Weidong Ma, Ziyang Wang

IJCAI 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The superiority of our method is demonstrated on synthetic and real data examples. In this section, we conduct numerical studies to assess the performance of the proposed estimator with synthetic data and four real data examples.
Researcher Affiliation	Academia	Joo Chul Lee1 , Weidong Ma2 and Ziyang Wang3 1Department of Mathematics and Statistics, Auburn University, USA 2Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, USA 3Department of Statistics, University of Connecticut, USA
Pseudocode	Yes	Algorithm 1 Unbiased Active Semi-supervised Learning Algorithm
Open Source Code	Yes	The codes used for the numerical studies are available on a Github repository https: //github.com/IJCAI-24/Active Semi Prediction.
Open Datasets	Yes	We apply the proposed algorithm to four real datasets; 1) Bank Marketing data, 2) SUSY data, 3) Credit Card Clients data and 4) Purchasing Intention data. The datasets are available on the UCI Machine Learning repository: 1) https://archive.ics.uci.edu/ml/datasets/bank+marketing, 2) https://archive.ics.uci.edu/ml/datasets/SUSY, 3) https://archive.ics.uci.edu/ml/datasets/default+of+credit+ card+clients, and 4) https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+ Purchasing+Intention+Dataset.
Dataset Splits	No	The paper describes using "full training data of size N = 10^5" and selecting "subdata of size 100" in "10 batches" with an initial "uniform samples of size 150." It does not explicitly mention a separate validation dataset or specific train/validation/test splits beyond the batch-wise active sampling process.
Hardware Specification	No	The paper does not provide specific hardware details such as CPU/GPU models, memory specifications, or cloud computing instances used for running its experiments.
Software Dependencies	No	The paper mentions using "Natural spline models" but does not specify any software libraries or dependencies with their version numbers that would be needed to replicate the experiment.
Experiment Setup	Yes	We generate full training data of size N = 10^5 and consider 10 batches. In each batch, we select the subdata of size 100. For the initial values in the proposed algorithm, uniform samples of size 150 are used. The repetition is 300 times. Natural spline models with 2 degree of freedom is considered for the imputation model in each repetition. For initial values and the subdata size, we consider 150 and100 for the ﬁrst two examples, and 200 and 200 for the other examples, respectively. The total number of batches is 10 and the repetition is 300 times. We build a natural spline model with 2 or 3 degree of freedom for the imputation models.