Semantic Lexicon Induction from Twitter with Pattern Relatedness and Flexible Term Length

Authors: Ashequl Qadir, Pablo Mendes, Daniel Gruhl, Neal Lewis

AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that our approach is able to learn high quality semantic lexicons from informally written social media text of Twitter, and can achieve accuracy as high as 92% in the top 100 learned category members. and Table 3: Accuracy of the induced lexicons up to top 100 terms. and Figure 1: Lexicon Growth Rate Comparison.
Researcher Affiliation Collaboration Ashequl Qadir University of Utah 50 S Central Campus Drive Salt Lake City, Utah 84112 asheq@cs.utah.edu and Pablo N. Mendes, Daniel Gruhl and Neal Lewis IBM Research Almaden 650 Harry Road San Jose, California 95120 pnmendes,dgruhl,nrlewis@us.ibm.com
Pseudocode No The paper describes the methodology in narrative text and does not include structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets No For this research, we collected 114 million English tweets published in Twitter during February and March, 2013, using Twitter 10% decahose stream. The paper describes collecting data from Twitter's stream, which is not directly equivalent to providing a publicly available dataset with a link or citation.
Dataset Splits No The paper describes using a corpus of tweets for lexicon induction and sampling for pattern pools, but it does not specify explicit training, validation, or test dataset splits for model evaluation.
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments.
Software Dependencies No The paper mentions using DISCO and BASILISK and an in-house tokenizer, but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes For each seed term s Sc for semantic category C, we first extract all N-gram context patterns containing up to 6 words, and store them in our pattern pool Pc. and We then remove any p Pc that has a confidence threshold lower than 10 6 to limit the initial pattern space. and We then take the top 2000 candidates as our initial set of candidates Tc for category C. and We use the average term boundary score for all t Tc to only keep the candidates that has a TBS greater than the average. and Then we rank the patterns by this score in descending order, and keep only the top 20% of the patterns ranked by the score.