Streaming Classification with Emerging New Class by Class Matrix Sketching
Authors: Xin Mu, Feida Zhu, Juan Du, Ee-Peng Lim, Zhi-Hua Zhou
AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The empirical evaluation shows the proposed method not only receives the comparable performance but also strengthens modelling on large-scale data sets. We conduct empirical studies on the benchmark data sets and a real-world topic of news data set to validate the effectiveness and efficiency of our approach. We conduct experiments on both simulated and real-world streams to comprehensively evaluate the performance. Experiment Experimental setup Data sets. Three benchmark data sets are used to assess the performance of all methods, including KDDCup99, Forest Cover, MNIST. Results Simulated stream. The results of simulated stream are shown in Figure 5 and Table 1. |
| Researcher Affiliation | Academia | Xin Mu,1,2 Feida Zhu,3 Juan Du,3 Ee-Peng Lim,3 Zhi-Hua Zhou1,2 1National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China 2Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing, 210023, China 3School of Information Systems, Singapore Management University, Singapore, 178902 {mux, zhouzh}@lamda.nju.edu.cn, {fdzhu, juandu, eplim}@smu.edu.sg |
| Pseudocode | Yes | Algorithm 1 Initialize Class Matrix Sketching Algorithm 2 Deploying SENC-Ma S in data stream Algorithm 3 Update |
| Open Source Code | No | The paper states "LACU-SVM and i Forest were the codes as released by the corresponding authors" for comparison methods, but it does not provide an explicit statement or link for the source code of its own proposed method (SENC-Ma S). |
| Open Datasets | Yes | Three benchmark data sets are used to assess the performance of all methods, including KDDCup991, Forest Cover2, MNIST3. In addition, a real news summary stream is used to evaluate performance. it is crawled over a period of time by using the New York Times API4. Each item is preprocessed using the word2vec technique5 to produce a 1000-dimension feature vector. |
| Dataset Splits | No | The paper states "An initial training set with two known classes is available to train the model" and "The data size of the initial training set D is 2000 per class." and describes how a "threshold" is determined using training instances. However, it does not specify explicit numerical train/validation/test splits (e.g., percentages or counts) for the continuous data stream or mention cross-validation. |
| Hardware Specification | No | The paper states "All methods are executed in the MATLAB environment" but provides no specific details about the hardware (e.g., CPU, GPU models, memory) used for the experiments. |
| Software Dependencies | No | The paper mentions "MATLAB environment" but does not specify a version. It also mentions "word2vec technique" and links to "gensim" but does not provide a version number for gensim or any other specific library/solver used in their implementation. |
| Experiment Setup | Yes | Number of trees in i Forest is set to 50 and ψ = 200. Parameters in LACU-SVM are set by ramps = 0.3, η = 1.3, λ = 0.1, max iter = 10 according to authors paper. ECSMiner employs K-means and K is set to 5. In SAND-F, ensemble size t is set to 6, q = 50 and τ= 0.4. In SENC-Ma S, the buffer of size s = 3000, L = N 0.8, li = ni 0.8. The data size of the initial training set D is 2000 per class. |