On Conceptual Labeling of a Bag of Words
Authors: Xiangyan Sun, Yanghua Xiao, Haixun Wang, Wei Wang
IJCAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on both synthetic data and real data. We also present case studies to verify the rationality of our approach. |
| Researcher Affiliation | Collaboration | School of Computer Science, Shanghai Key Laboratory of Data Science Fudan University, Shanghai, China Google Research, USA |
| Pseudocode | No | The paper describes the search strategy using prose in Section 3.6, but it does not present it in a structured pseudocode or algorithm block. |
| Open Source Code | No | The paper states 'Probase data is available at http://probase.msra.cn/dataset.aspx', which refers to a dataset used, not the open-source code for the methodology described in the paper. No explicit statement about making the authors' code available is found. |
| Open Datasets | Yes | In this paper, we use Probase2 to provide us fine-grained concepts and their statistics. Probase is acquired from 1.68 billion web pages. It extracts is A relations from sentences matching Hearst patterns [Hearst, 1992]. The core version of Probase contains 3,024,814 unique concepts, 6,768,623 unique instances, and 29,625,920 is A relations. Probase data is available at http://probase.msra.cn/dataset.aspx |
| Dataset Splits | No | The paper mentions generating 't = 1000 bags of words for evaluation' for synthetic data and manually evaluating '100 test cases randomly selected' from real data, but it does not provide specific train/validation/test dataset split percentages, absolute sample counts for each split, or detailed splitting methodology. |
| Hardware Specification | No | The paper does not specify the hardware used for running experiments (e.g., CPU or GPU models, memory, or cloud computing infrastructure details). |
| Software Dependencies | No | The paper mentions using 'LDA[Blei et al., 2003]' but does not provide specific version numbers for this or any other software dependencies used in their experiments. |
| Experiment Setup | Yes | In Section 3.5, we introduce an additional parameter α to adjust the tradeoff between coverage and minimality. By default α = 0.5. A larger α value indicates the description length of concepts are weighted higher than input words, thus fewer concepts will be generated, vice versa. In Section 4.1, it describes varying parameters `nc`, `ni`, and `nn` to guide the generation process of synthetic data. |