reproducibilityindex.ai

Unsupervised Deep Keyphrase Generation

Authors: Xianjie Shen, Yinghan Wang, Rui Meng, Jingbo Shang11303-11311

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that Auto Key Gen outperforms all unsupervised baselines and can even beat a strong supervised method in certain cases.
Researcher Affiliation	Collaboration	Xianjie Shen1, Yinghan Wang2*, Rui Meng3 , Jingbo Shang4 1,4University of California, San Diego 2Amazon.com Inc 3Salesforce Research
Pseudocode	No	The paper describes the method in prose and through a diagram (Figure 1) but does not provide structured pseudocode or algorithm blocks.
Open Source Code	Yes	Reproducibility. Codes and datasets for reproducing the results of this study will be released on Git Hub1 1https://github.com/Jayshen0/Unsupervised Deep-Keyphrase-Generation
Open Datasets	Yes	We follow previous keyphrase generation studies (Meng et al. 2017; Ye and Wang 2018; Meng et al. 2019; Chen et al. 2019) and adopt five scientific publication datasets for evaluation. KP20k is the largest dataset in scientific keyphrase studies thus far. ... Table 1 presents the details of all datasets2. 2Dataset release is from https://github.com/memray/ Open NMT-kpg-release
Dataset Splits	Yes	Dataset Train Valid Test KP20k 514,154 19,992 19,987
Hardware Specification	No	The paper does not specify any hardware details like GPU/CPU models, memory, or specific computing environments used for the experiments.
Software Dependencies	No	For all keyphrase extraction baselines, we utilize the open-source toolkit pke3 and Embed Rank4. ... We apply Porter Stemmer provided by NLTK (Bird, Klein, and Loper 2009). The paper mentions software tools like 'pke' and 'Embed Rank' and libraries like 'NLTK' and 'Adagrad' but does not specify their version numbers.
Experiment Setup	Yes	The vocabulary V of Seq2Seq models consists of 50,000 most frequent uncased words. We train the models for 500,000 steps and take the last checkpoint for inference. The dimension of LSTM cell is 256, the embedding dimension is 200, and the max length of the source text is 512. Models are optimized using Adagrad (Duchi, Hazan, and Singer 2011) with the initial learning rate to be 0.001, and it will be linearly decayed by a ratio of 0.8 for every 5 epochs. For inference, the width of beam search is set to 20.