Unsupervised Phrasal Near-Synonym Generation from Text Corpora

Authors: Dishan Gupta, Jaime Carbonell, Anatole Gershman, Steve Klein, David Miller

AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental An evaluation framework with crowd-sourced judgments is proposed and results are compared with alternate methods, demonstrating considerably superior results to the literature and to thesaurus look up for multi-word phrases. The Gigaword Corpus We selected the very large English Gigaword Fifth Edition (Parker et al. 2011), a comprehensive archive of newswire text data, for our experiments.
Researcher Affiliation Collaboration Dishan Gupta Jaime Carbonell Anatole Gershman Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213 USA dishang@cs.cmu.edu jgc@cs.cmu.edu anatole.gershman@gmail.com Steve Klein David Miller Meaningful Machines, LLC steve@applecoreholdings.com dave@applecoreholdings.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures).
Open Source Code No The paper states 'The query phrases, annotations and other results can be downloaded at http://www.cs.cmu.edu/~dishang/' which refers to data and results, not the source code for the methodology described in the paper.
Open Datasets Yes The Gigaword Corpus We selected the very large English Gigaword Fifth Edition (Parker et al. 2011), a comprehensive archive of newswire text data, for our experiments.
Dataset Splits No The paper mentions splitting the corpus for processing ('The corpus was split into 32 equal parts', 'We used 37.5% of the data'), but it does not provide explicit training, validation, and test dataset splits with percentages or sample counts.
Hardware Specification No The paper mentions 'Since, the server hardware can support up to 32 (16x2) threads in parallel' but does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running experiments.
Software Dependencies No The paper mentions software like 'Hadoop' but does not provide specific ancillary software details, such as library or solver names with version numbers.
Experiment Setup Yes Parameter Training Equation 1 contains the parameters , , , and separately for , and each along with the cradle boosting parameter , for a total of 13 parameters. One possible parameter training scheme, is to generate training data consisting of query phrases ( ), and pick near-synonym candidates rated as highly synonymous by human judges. A natural optimization objective would then be: with the constraint that all the parameters > 0. is a product of two nonnegative convex functions, and is therefore convex. This makes the optimization objective a difference of two convex functions (DC class) and its direct optimization is reserved for future work. For the present we relied on multi-start coordinate ascent with binary search instead of increasing the linear step size increase. The parameters were trained on a set of 30 query phrases, separate from the ones used in the evaluation (see section Experiments).