Bounding Uncertainty for Active Batch Selection
Authors: Hanmo Wang, Runwu Zhou, Yi-Dong Shen5240-5247
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on fifteen datasets indicate that our method has significantly higher classification accuracy on testing data than the latest state-of-the-art BMAL methods, and also scales better even when the size of the unlabeled pool reaches 106. |
| Researcher Affiliation | Academia | 1State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, China 2University of Chinese Academy of Sciences, Beijing 100049, China {wanghm,zhourw,ydshen}@ios.ac.cn |
| Pseudocode | Yes | Algorithm 1 Rand Greedy(U,b) Algorithm 2 BMAL based on LBC |
| Open Source Code | No | No explicit statement or link is provided for the open-source code of the methodology described in the paper. |
| Open Datasets | Yes | We use fifteen benchmark datasets, seven of which are from UCI machine learning repository (Dheeru and Karra Taniskidou 2017), namely segmentation, waveform, twonorm, HIGGS, covtype, SUSY and letter. The other eight datasets are Reuters, RCV1, TDT2, 20News, WEBACE, ORL, COIL20 and USPS, which are publicly available2. (footnote 2: http://www.cad.zju.edu.cn/home/dengcai/) |
| Dataset Splits | No | No specific validation dataset split information (percentages, counts, or predefined splits) is provided. The paper mentions splitting into unlabeled (60%) and testing (40%) data. |
| Hardware Specification | No | No specific hardware details (e.g., CPU/GPU models, memory) used for running the experiments are mentioned in the paper. |
| Software Dependencies | No | The paper mentions 'Logistic Regression is used as the classifier' and 'Gaussian kernel' and other methods, but no specific version numbers for any software components or libraries are provided. |
| Experiment Setup | Yes | The batch size b is fixed to be 100 on large datasets covtype, SUSY and HIGGS, 50 on letter and 20News, and 10 on other small datasets. Logistic Regression is used as the classifier. For each dataset, the experiment is conducted 10 times. The averaged result is reported. We use Gaussian kernel on all datasets. For data instances x and y we set K(x, y) = exp( ||x y||2/p), where the parameter p is the median of all pair-wise squared Euclidean distances over the unlabeled data. We sort all the unlabeled samples increasingly according to their certainty in Eq. (10), and set hyper-parameter ϵ to be the β-th percentile (0 < β < 100). We empirically use two hyper-parameter γ and τ to describe β as β = γ (nu/n)τ, where γ and τ is fixed to be 20 and 10 respectively. For hyper-parameter λ, we set λ = b2. |