Unsupervised Phrasal Near-Synonym Generation from Text Corpora
Authors: Dishan Gupta, Jaime Carbonell, Anatole Gershman, Steve Klein, David Miller
AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | An evaluation framework with crowd-sourced judgments is proposed and results are compared with alternate methods, demonstrating considerably superior results to the literature and to thesaurus look up for multi-word phrases. The Gigaword Corpus We selected the very large English Gigaword Fifth Edition (Parker et al. 2011), a comprehensive archive of newswire text data, for our experiments. |
| Researcher Affiliation | Collaboration | Dishan Gupta Jaime Carbonell Anatole Gershman Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213 USA dishang@cs.cmu.edu jgc@cs.cmu.edu anatole.gershman@gmail.com Steve Klein David Miller Meaningful Machines, LLC steve@applecoreholdings.com dave@applecoreholdings.com |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures). |
| Open Source Code | No | The paper states 'The query phrases, annotations and other results can be downloaded at http://www.cs.cmu.edu/~dishang/' which refers to data and results, not the source code for the methodology described in the paper. |
| Open Datasets | Yes | The Gigaword Corpus We selected the very large English Gigaword Fifth Edition (Parker et al. 2011), a comprehensive archive of newswire text data, for our experiments. |
| Dataset Splits | No | The paper mentions splitting the corpus for processing ('The corpus was split into 32 equal parts', 'We used 37.5% of the data'), but it does not provide explicit training, validation, and test dataset splits with percentages or sample counts. |
| Hardware Specification | No | The paper mentions 'Since, the server hardware can support up to 32 (16x2) threads in parallel' but does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running experiments. |
| Software Dependencies | No | The paper mentions software like 'Hadoop' but does not provide specific ancillary software details, such as library or solver names with version numbers. |
| Experiment Setup | Yes | Parameter Training Equation 1 contains the parameters , , , and separately for , and each along with the cradle boosting parameter , for a total of 13 parameters. One possible parameter training scheme, is to generate training data consisting of query phrases ( ), and pick near-synonym candidates rated as highly synonymous by human judges. A natural optimization objective would then be: with the constraint that all the parameters > 0. is a product of two nonnegative convex functions, and is therefore convex. This makes the optimization objective a difference of two convex functions (DC class) and its direct optimization is reserved for future work. For the present we relied on multi-start coordinate ascent with binary search instead of increasing the linear step size increase. The parameters were trained on a set of 30 query phrases, separate from the ones used in the evaluation (see section Experiments). |