Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Non-splitting Neyman-Pearson Classifiers

Authors: Jingming Wang, Lucy Xia, Zhigang Bao, Xin Tong

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Numerical experiments have confirmed the advantages of our new non-splitting parametric strategy. Numerical Analysis, Simulation Studies, Real Data Analysis sections are included.
Researcher Affiliation Academia Jingming Wang EMAIL Department of Statistics University of Virginia, Lucy Xia EMAIL Department of ISOM School of Business and Management Hong Kong University of Science and Technology, Zhigang Bao EMAIL Department of Mathematics The University of Hong Kong, Xin Tong EMAIL Department of Data Sciences and Operations Marshall Business School University of Southern California
Pseudocode No The paper describes methods in text and mathematical formulas, but no explicit pseudocode or algorithm blocks are provided.
Open Source Code No The paper discusses licensing for the publication itself but does not provide any statement or link regarding the open-sourcing of the code for the methodology described.
Open Datasets Yes Fashion MNIST is a widely-used imaging dataset for benchmarking machine learning algorithms. It contains 60,000 training data and 10,000 testing data from ten different fashion categories... The first dataset is a lung cancer dataset (Gordon et al., 2002; Jin and Wang, 2016) that consists of gene expression measurements from 181 tissue samples. The second dataset was originally studied in (Su et al., 2001). It contains microarray data from 11 different tumor types... We consider the popular network intrusion classification problem and apply the NP classifiers to the CSE-CIC-IDS2018 dataset (Sharafaldin et al., 2018).
Dataset Splits Yes In each replication, we randomly split the full dataset (class 0 and class 1 separately) into a training set (composed of 70% of the data), and a test set (composed of 30% of the data).
Hardware Specification No The paper does not provide specific details about the hardware used for running its experiments.
Software Dependencies No The paper mentions implementing NP umbrella algorithms using 'the R package npc with default parameters,' but it does not specify version numbers for R or the npc package.
Experiment Setup Yes For all five splitting NP classifiers, τ, the class 0 split proportion, is fixed at 0.5, and the each experiment is repeated 1,000 times. We set the type I error upper bound α = 0.05 and the type I error violation rate target δ = 0.1. ... For NP-svm, npc adopted the radial kernel for analysis. ... In the first scenario, we randomly selected 10% of the dataset as training data... In the second scenario, we randomly selected 5% of the dataset as training data...