reproducibilityindex.ai

Unsupervised Phrasal Near-Synonym Generation from Text Corpora

Authors: Dishan Gupta, Jaime Carbonell, Anatole Gershman, Steve Klein, David Miller

AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	An evaluation framework with crowd-sourced judgments is proposed and results are compared with alternate methods, demonstrating considerably superior results to the literature and to thesaurus look up for multi-word phrases. The Gigaword Corpus We selected the very large English Gigaword Fifth Edition (Parker et al. 2011), a comprehensive archive of newswire text data, for our experiments.
Researcher Affiliation	Collaboration	Dishan Gupta Jaime Carbonell Anatole Gershman Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213 USA dishang@cs.cmu.edu jgc@cs.cmu.edu anatole.gershman@gmail.com Steve Klein David Miller Meaningful Machines, LLC steve@applecoreholdings.com dave@applecoreholdings.com
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures).
Open Source Code	No	The paper states 'The query phrases, annotations and other results can be downloaded at http://www.cs.cmu.edu/~dishang/' which refers to data and results, not the source code for the methodology described in the paper.
Open Datasets	Yes	The Gigaword Corpus We selected the very large English Gigaword Fifth Edition (Parker et al. 2011), a comprehensive archive of newswire text data, for our experiments.
Dataset Splits	No	The paper mentions splitting the corpus for processing ('The corpus was split into 32 equal parts', 'We used 37.5% of the data'), but it does not provide explicit training, validation, and test dataset splits with percentages or sample counts.
Hardware Specification	No	The paper mentions 'Since, the server hardware can support up to 32 (16x2) threads in parallel' but does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running experiments.
Software Dependencies	No	The paper mentions software like 'Hadoop' but does not provide specific ancillary software details, such as library or solver names with version numbers.
Experiment Setup	Yes	Parameter Training Equation 1 contains the parameters , , , and separately for , and each along with the cradle boosting parameter , for a total of 13 parameters. One possible parameter training scheme, is to generate training data consisting of query phrases ( ), and pick near-synonym candidates rated as highly synonymous by human judges. A natural optimization objective would then be: with the constraint that all the parameters > 0. is a product of two nonnegative convex functions, and is therefore convex. This makes the optimization objective a difference of two convex functions (DC class) and its direct optimization is reserved for future work. For the present we relied on multi-start coordinate ascent with binary search instead of increasing the linear step size increase. The parameters were trained on a set of 30 query phrases, separate from the ones used in the evaluation (see section Experiments).