reproducibilityindex.ai

A Knowledge-Grounded Neural Conversation Model

Authors: Marjan Ghazvininejad, Chris Brockett, Ming-Wei Chang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, Michel Galley

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our approach yields signiﬁcant improvements over a competitive SEQ2SEQ baseline. Human judges found that our outputs are signiﬁcantly more informative.Using this framework, we have trained systems at a large scale using 23M general-domain conversations from Twitter and 1.1M Foursquare tips, showing signiﬁcant improvements in terms of informativeness (human evaluation) over a competitive large-scale SEQ2SEQ model baseline.
Researcher Affiliation	Collaboration	Marjan Ghazvininejad,1 Chris Brockett,2 Ming-Wei Chang,2 Bill Dolan,2 Jianfeng Gao,2 Wen-tau Yih,2 Michel Galley2 1Information Sciences Institute, USC 2Microsoft ghazvini@isi.edu, mgalley@microsoft.com
Pseudocode	No	The paper describes its model architecture and components (e.g., Dialog Encoder and Decoder, Facts Encoder) but does not include any pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any statement or link indicating the release of source code for the described methodology.
Open Datasets	No	We collected a 23M general dataset of 3-turn conversations. This serves as a background dataset not associated with facts, and its massive size is key to learning the conversational structure or backbone.We extracted from the web 1.1M tips relating to establishments in North America.The paper describes how they collected and processed these datasets from public platforms, but it does not provide concrete access information (e.g., URL, DOI, or formal citation with author/year for their specific constructed dataset) to the actual datasets used in their experiments.
Dataset Splits	Yes	Crowdsourced human judges were then presented with these 10K sampled conversations and asked to determine whether the response contained actionable information, i.e., did they contain information that would permit the respondents to decide, e.g., whether or not they should patronize an establishment. From this, we selected the topranked 4k conversations to be held out as validation set and test set; these were removed from our training data.
Hardware Specification	No	The paper describes the model architecture and size (e.g., '2-layer GRU models with 512 hidden cells'), but does not provide any specific hardware details such as GPU models, CPU types, or memory used for training or experiments.
Software Dependencies	No	The paper mentions using specific algorithms and optimizers ('GRU models', 'Adam optimizer'), but does not provide version numbers for any programming languages, libraries, or frameworks (e.g., Python version, TensorFlow/PyTorch version).
Experiment Setup	Yes	More speciﬁcally, we used 2-layer GRU models with 512 hidden cells for each layer for encoder and decoder, the dimensionality of word embeddings is set to 512, and the size of input/output memory representation is 1024. We used the Adam optimizer with a ﬁxed learning rate of 0.1. Batch size is set to 128. All parameters are initialized from a uniform distribution in [ 3/d], where d is the dimension of the parameter. Gradients are clipped at 5 to avoid gradient explosion.