Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MicroScholar: Mining Scholarly Information from Chinese Microblogs
Authors: Yang Yu, Xiaojun Wan
AAAI 2016 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | experimental results demonstrate their usefulness.In order to evaluate the classification performance, we crawl several thousand microblog texts and manually annotate them into four types described above, then construct a balanced evaluation dataset of 2,142 microblog texts (592: 491: 514: 545 for the four categories) by sampling from the whole annotation corpus. We perform all SVM experiments in 10-fold cross validation. The evaluation results are shown in Table 1. |
| Researcher Affiliation | Academia | Institute of Computer Science and Technology, Peking University, Beijing 100871, China The MOE Key Laboratory of Computational Linguistics, Peking University, Beijing 100871, China EMAIL |
| Pseudocode | No | The paper does not contain any structured pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper references third-party tools like WEKA and Gibbs LDA++ with URLs, but it does not provide any concrete access to the source code for the 'Micro Scholar' system or the specific methodology described in the paper. |
| Open Datasets | No | The paper describes the creation of its own evaluation dataset (2,142 microblog texts) and an unlabeled corpus (113,925 microblogs) by crawling and manual annotation. However, it does not provide a direct link, DOI, repository name, or formal citation for accessing these specific datasets created by the authors for their experiments. |
| Dataset Splits | Yes | We perform all SVM experiments in 10-fold cross validation. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions 'WEKA toolbox' and 'Gibbs LDA++' but does not provide specific version numbers for these or any other software dependencies needed to replicate the experiment. |
| Experiment Setup | Yes | We utilize the popular SVM classifier for the categorization task and apply the SMO algorithm in WEKA toolbox for implementation. We apply Gibbs LDA++ for the LDA implementation, and the number of topics is set to 300. Figure 2 plots the performance values of SVM(T+D+LDA) with respect to different number of topics ranging from 100 to 500. |