Open Domain Short Text Conceptualization: A Generative + Descriptive Modeling Approach
Authors: Yangqiu Song, Shusen Wang, Haixun Wang
IJCAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we show experiments on two short text data sets to compare our method with existing conceptualization methods. News Title: We extract news titles from a news corpus containing about one million articles searched from Web pages. The news articles have been classified into topics. We select six topics, i.e., company, disease, entertainment, food, politician, and sports, to evaluate different approaches. We randomly select 3,000 news articles in each topic, and only keep the title field. We call this data set the News Title Data Set. The average word count of the 18,000 news titles is 7.96. Twitter: In this data set, the 4,542 tweets are in three categories: company (1,205), country (1,747), and device (1,590). The average length of the Tweets is 13.36 words. |
| Researcher Affiliation | Collaboration | a University of Illinois at Urbana-Champaign b Zhejiang University c Google Research |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions data availability for Probase, an external knowledge base used in the work, but does not provide access to the authors' own implementation code for the described methodology. |
| Open Datasets | No | The paper describes the characteristics of the "News Title Data Set" and "Twitter" data set, and states they were extracted or composed by the authors, but does not provide specific access information (e.g., link, DOI, or formal citation to a public repository) for these exact datasets as used in the experiments. It cites Probase and Wikipedia as knowledge bases but these are not the experimental evaluation datasets. |
| Dataset Splits | No | The paper describes the datasets and some selection criteria (e.g., "randomly select 3,000 news articles"), but does not specify train, validation, or test splits for the experimental evaluation. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to conduct the experiments. |
| Software Dependencies | No | The paper mentions using "Mallet [Mc Callum, 2002]" for LDA, but does not provide a version number for Mallet or any other software dependencies, which would be necessary for reproducibility. |
| Experiment Setup | Yes | We set the topic number to be the cluster number or twice the cluster number and report the better of the two. This method is denoted as LDA #1. The topic number is set to be 10 or 20, and we report the better of the two. This method is denoted as LDA #2. Top 1,000, 2,000, and top 10,000 concepts are used as features for clustering, and we report the best. The top 100, 200 and 400 concepts are used for clustering respectively, and we report the best. We compute the concept distribution c for each text, and use top 400 concepts in the clustering experiments. |