Instance Selection: A Bayesian Decision Theory Perspective
Authors: Qingqiang Chen, Fuyuan Cao, Ying Xing, Jiye Liang6287-6294
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The performance of our method is studied on extensive synthetic and benchmark data sets. Experiments To properly examine the performance of BDIS, we employ the random sampling (RS), which is one of the most classic and commonly used instance selection methods, RIS, and EGDIS as baseline methods. The comparisons are carried out on multiple synthetic and 12 benchmark data sets which are available at the UCI Repository (Dua and Graff 2017). Table 2: Comparisons of classification accuracy (A) and reduction rate (R). |
| Researcher Affiliation | Academia | Qingqiang Chen, Fuyuan Cao, Ying Xing, Jiye Liang School of Computer and Information Technology, Shanxi University, Taiyuan 030006, P.R. China chenqq18@126.com, cfy@sxu.edu.cn, sxxying@126.com, ljy@sxu.edu.cn |
| Pseudocode | Yes | Algorithm 1: BDIS 1: Input: Training set Dtr = {(x1, y1) . . . , (xn, yn)}. 2: Parameters: Truncation threshold τ1 and τ2. 3: Output: Reduced set, R. 4: Employ accelerated k-means algorithm to cluster Dtr into two sub-clusters; 5: for each sub-cluster do 6: if the labels of data in the sub-cluster are same then 7: We consider the sub-cluster is a LHC and record the data within it and its cluster center; 8: else 9: Iteratively divide the sub-cluster until it is composed of one or more LHCs. 10: end if 11: end for 12: In each class of data, the LHCs with the number of instances between τ1 and τ2 are selected, and the instances closest to the center of these LHCs are added to R. |
| Open Source Code | Yes | All code and data results are available at https://github.com/CQQXY161120/Instance-Selection. |
| Open Datasets | Yes | The comparisons are carried out on multiple synthetic and 12 benchmark data sets which are available at the UCI Repository (Dua and Graff 2017). |
| Dataset Splits | Yes | All experimental results are obtained through 10-fold cross-validation. |
| Hardware Specification | Yes | The experiments are conducted on an Intel i77700 CPU@3.60HZ and 48G RAM. |
| Software Dependencies | No | No specific software dependencies with version numbers are provided. The paper mentions "FAISS (Johnson, Douze, and J egou 2017) to accelerate k-means clustering" but does not specify a version number for FAISS. |
| Experiment Setup | Yes | Therefore, in order to balance the amount of instances and the generalization performance of classifier, we empirically set k1 = 0 and k2 = 7 for subsequent experiments. |