Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Streaming Classification with Emerging New Class by Class Matrix Sketching
Authors: Xin Mu, Feida Zhu, Juan Du, Ee-Peng Lim, Zhi-Hua Zhou
AAAI 2017 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The empirical evaluation shows the proposed method not only receives the comparable performance but also strengthens modelling on large-scale data sets. We conduct empirical studies on the benchmark data sets and a real-world topic of news data set to validate the effectiveness and efficiency of our approach. We conduct experiments on both simulated and real-world streams to comprehensively evaluate the performance. Experiment Experimental setup Data sets. Three benchmark data sets are used to assess the performance of all methods, including KDDCup99, Forest Cover, MNIST. Results Simulated stream. The results of simulated stream are shown in Figure 5 and Table 1. |
| Researcher Affiliation | Academia | Xin Mu,1,2 Feida Zhu,3 Juan Du,3 Ee-Peng Lim,3 Zhi-Hua Zhou1,2 1National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China 2Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing, 210023, China 3School of Information Systems, Singapore Management University, Singapore, 178902 EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Initialize Class Matrix Sketching Algorithm 2 Deploying SENC-Ma S in data stream Algorithm 3 Update |
| Open Source Code | No | The paper states "LACU-SVM and i Forest were the codes as released by the corresponding authors" for comparison methods, but it does not provide an explicit statement or link for the source code of its own proposed method (SENC-Ma S). |
| Open Datasets | Yes | Three benchmark data sets are used to assess the performance of all methods, including KDDCup991, Forest Cover2, MNIST3. In addition, a real news summary stream is used to evaluate performance. it is crawled over a period of time by using the New York Times API4. Each item is preprocessed using the word2vec technique5 to produce a 1000-dimension feature vector. |
| Dataset Splits | No | The paper states "An initial training set with two known classes is available to train the model" and "The data size of the initial training set D is 2000 per class." and describes how a "threshold" is determined using training instances. However, it does not specify explicit numerical train/validation/test splits (e.g., percentages or counts) for the continuous data stream or mention cross-validation. |
| Hardware Specification | No | The paper states "All methods are executed in the MATLAB environment" but provides no specific details about the hardware (e.g., CPU, GPU models, memory) used for the experiments. |
| Software Dependencies | No | The paper mentions "MATLAB environment" but does not specify a version. It also mentions "word2vec technique" and links to "gensim" but does not provide a version number for gensim or any other specific library/solver used in their implementation. |
| Experiment Setup | Yes | Number of trees in i Forest is set to 50 and ψ = 200. Parameters in LACU-SVM are set by ramps = 0.3, η = 1.3, λ = 0.1, max iter = 10 according to authors paper. ECSMiner employs K-means and K is set to 5. In SAND-F, ensemble size t is set to 6, q = 50 and τ= 0.4. In SENC-Ma S, the buffer of size s = 3000, L = N 0.8, li = ni 0.8. The data size of the initial training set D is 2000 per class. |