reproducibilityindex.ai

KGR4: Retrieval, Retrospect, Refine and Rethink for Commonsense Generation

Authors: Xin Liu, Dayiheng Liu, Baosong Yang, Haibo Zhang, Junwei Ding, Wenqing Yao, Weihua Luo, Haiying Zhang, Jinsong Su11029-11037

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results and in-depth analysis on the Common Gen benchmark strongly demonstrate the effectiveness of our framework. Particularly, KGR4 obtains 33.56 SPICE points in the official leaderboard, outperforming the previously-reported best result by 2.49 SPICE points and achieving state-of-the-art performance.
Researcher Affiliation	Collaboration	Xin Liu1,3 Dayiheng Liu2 Baosong Yang2 Haibo Zhang2 Junwei Ding2 Wenqing Yao2 Weihua Luo2 Haiying Zhang1 Jinsong Su1,3,4* 1School of Informatics, Xiamen University 2Alibaba Group 3 Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan, Ministry of Culture and Tourism 4 Pengcheng Lab, Shenzhen
Pseudocode	No	The paper describes its framework and processes in prose and uses diagrams (Figure 1 and Figure 2) to illustrate them. However, it does not include any structured pseudocode blocks or algorithms labeled as such.
Open Source Code	Yes	We release the code at https://github.com/DeepLearn-XMU/KGR-4.
Open Datasets	Yes	Following previous studies, we use the Common Gen dataset constructed by Lin et al. (2020). We construct this corpus by combining 3M image and video captions of several datasets: Activity (Krishna et al. 2017), Multi NLI (Williams, Nangia, and Bowman 2018), SNLI (Bowman et al. 2015), Vatex (Wang et al. 2019), MSCOCO (Lin et al. 2014) and (Young et al. 2014).
Dataset Splits	Yes	We show the basic statistics of this dataset in Table 2. Table 2: The basic statistics of the Common Gen dataset. #Concept Sets Train Validation Test 32,651 993 1,497
Hardware Specification	No	The paper does not provide specific details about the hardware used for experiments, such as CPU/GPU models, memory, or specific computing environments beyond general terms.
Software Dependencies	No	The paper mentions various models and optimizers used (e.g., BART-large, T5, RoBERTa-based classifier, Adam optimizer) and refers to their original papers. However, it does not list specific version numbers for software dependencies like Python, PyTorch, TensorFlow, CUDA, or other relevant libraries.
Experiment Setup	Yes	At the retrieval stage, we select 3 negative samples for each positive sample. We optimize the RoBERTa-based scorer using the Adam optimizer (Kingma and Ba 2015) with a learning rate of 2e-5 for 3 epochs, and set the batch-size to 32. At the retrospect stage, we pretrain the generator for 80,000 steps using the pseudo instances constructed from the external corpus and then finetune the model parameters for 2,000 steps, where the learning rate of the Adam optimizer is set as 2e-5, and the batch size is 16. In both pretraining and retrospective augmentation, we sample 5 concepts from each sentence. We update the parameters of refiner for 2,000 steps and keep the rest hyper-parameters same as the generator. Particularly, we employ early-stopping when training scorer, generator, and refiner.