On Conceptual Labeling of a Bag of Words

Authors: Xiangyan Sun, Yanghua Xiao, Haixun Wang, Wei Wang

IJCAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on both synthetic data and real data. We also present case studies to verify the rationality of our approach.
Researcher Affiliation Collaboration School of Computer Science, Shanghai Key Laboratory of Data Science Fudan University, Shanghai, China Google Research, USA
Pseudocode No The paper describes the search strategy using prose in Section 3.6, but it does not present it in a structured pseudocode or algorithm block.
Open Source Code No The paper states 'Probase data is available at http://probase.msra.cn/dataset.aspx', which refers to a dataset used, not the open-source code for the methodology described in the paper. No explicit statement about making the authors' code available is found.
Open Datasets Yes In this paper, we use Probase2 to provide us fine-grained concepts and their statistics. Probase is acquired from 1.68 billion web pages. It extracts is A relations from sentences matching Hearst patterns [Hearst, 1992]. The core version of Probase contains 3,024,814 unique concepts, 6,768,623 unique instances, and 29,625,920 is A relations. Probase data is available at http://probase.msra.cn/dataset.aspx
Dataset Splits No The paper mentions generating 't = 1000 bags of words for evaluation' for synthetic data and manually evaluating '100 test cases randomly selected' from real data, but it does not provide specific train/validation/test dataset split percentages, absolute sample counts for each split, or detailed splitting methodology.
Hardware Specification No The paper does not specify the hardware used for running experiments (e.g., CPU or GPU models, memory, or cloud computing infrastructure details).
Software Dependencies No The paper mentions using 'LDA[Blei et al., 2003]' but does not provide specific version numbers for this or any other software dependencies used in their experiments.
Experiment Setup Yes In Section 3.5, we introduce an additional parameter α to adjust the tradeoff between coverage and minimality. By default α = 0.5. A larger α value indicates the description length of concepts are weighted higher than input words, thus fewer concepts will be generated, vice versa. In Section 4.1, it describes varying parameters `nc`, `ni`, and `nn` to guide the generation process of synthetic data.