Unsupervised Deep Keyphrase Generation
Authors: Xianjie Shen, Yinghan Wang, Rui Meng, Jingbo Shang11303-11311
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that Auto Key Gen outperforms all unsupervised baselines and can even beat a strong supervised method in certain cases. |
| Researcher Affiliation | Collaboration | Xianjie Shen1, Yinghan Wang2*, Rui Meng3 , Jingbo Shang4 1,4University of California, San Diego 2Amazon.com Inc 3Salesforce Research |
| Pseudocode | No | The paper describes the method in prose and through a diagram (Figure 1) but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Reproducibility. Codes and datasets for reproducing the results of this study will be released on Git Hub1 1https://github.com/Jayshen0/Unsupervised Deep-Keyphrase-Generation |
| Open Datasets | Yes | We follow previous keyphrase generation studies (Meng et al. 2017; Ye and Wang 2018; Meng et al. 2019; Chen et al. 2019) and adopt five scientific publication datasets for evaluation. KP20k is the largest dataset in scientific keyphrase studies thus far. ... Table 1 presents the details of all datasets2. 2Dataset release is from https://github.com/memray/ Open NMT-kpg-release |
| Dataset Splits | Yes | Dataset Train Valid Test KP20k 514,154 19,992 19,987 |
| Hardware Specification | No | The paper does not specify any hardware details like GPU/CPU models, memory, or specific computing environments used for the experiments. |
| Software Dependencies | No | For all keyphrase extraction baselines, we utilize the open-source toolkit pke3 and Embed Rank4. ... We apply Porter Stemmer provided by NLTK (Bird, Klein, and Loper 2009). The paper mentions software tools like 'pke' and 'Embed Rank' and libraries like 'NLTK' and 'Adagrad' but does not specify their version numbers. |
| Experiment Setup | Yes | The vocabulary V of Seq2Seq models consists of 50,000 most frequent uncased words. We train the models for 500,000 steps and take the last checkpoint for inference. The dimension of LSTM cell is 256, the embedding dimension is 200, and the max length of the source text is 512. Models are optimized using Adagrad (Duchi, Hazan, and Singer 2011) with the initial learning rate to be 0.001, and it will be linearly decayed by a ratio of 0.8 for every 5 epochs. For inference, the width of beam search is set to 20. |