Adaptive Beam Search Decoding for Discrete Keyphrase Generation

Authors: Xiaoli Huang, Tongge Xu, Lvan Jiao, Yueran Zu, Youmin Zhang13082-13089

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on five public datasets demonstrate the proposed model can generate marginally less duplicated and more accurate keyphrases.
Researcher Affiliation Academia Xiaoli Huang,1 Tongge Xu,2 Lvan Jiao,1 Yueran Zu,1 Youmin Zhang3 1 School of Computer Science and Engineering, Beihang University 2 School of Cyber Science and Technology, Beihang University 3 Jiangxi Research Institute of Beihang University
Pseudocode No The paper describes the model architecture and methods in text and with diagrams, but does not include any pseudocode or algorithm blocks.
Open Source Code Yes The codes of Ada GM are available at: https://github.com/huangxiaolist/ada GM.
Open Datasets Yes Experiments are carried out on five scientific publication datasets, including KP20k (Meng et al. 2017), Inspec (Hulth 2003), Krapivin (Krapivin and Marchese 2009), NUS (Nguyen and Kan 2007), and Sem Eval (Kim et al. 2010).
Dataset Splits Yes After the two operations, the training, validation, and testing samples of the KP20k dataset are 509,818, 20,000, 20,000, respectively.
Hardware Specification Yes For a fair comparison, we use the same device (i.e., GTX-1080Ti)
Software Dependencies No The paper mentions using the Adam optimization algorithm and provides various model hyperparameters, but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes In the preprocessing stage, following (Yuan et al. 2018; Chan et al. 2019), for each document, we lowercase all characters, replace digits with a specific token <digit>, sort all the present keyphrase labels according to where they first appear in the document and append absent keyphrases. We set the vocabulary as the most frequent 50,002 words and share it between the encoder and decoder. We set the dimension of word embedding to 100 and the hidden size of the encoder and decoder to 300. The word embedding is initialized using a uniform distribution within [ 0.1, 0.1 ]. The initial state of the decoder is initialized as the encoder s last time-step s hidden state. Dropout with a rate of 0.1 is applied to both the encoder and decoder states. During the training stage, we use the Adam optimization algorithm (Kingma and Ba 2014) with an initial learning rate of 0.001. The learning rate will be halved if the validation loss stops dropping. Early stopping is applied when validation loss stops decreasing for three contiguous checkpoints. We also set gradient clipping of 1.0, batch size of 32, and train our model for three epochs. During the test stage, we set beam-size as 20 and threshold α as 0.015. Moreover, we calculate F1@5 and F1@M after removing all the duplicated keyphrases.