reproducibilityindex.ai

Learning through Dialogue Interactions by Asking Questions

Authors: Jiwei Li, Alexander H. Miller, Sumit Chopra, Marc'Aurelio Ranzato, Jason Weston

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We investigate how a learner can beneﬁt from asking questions in both ofﬂine and online reinforcement learning settings, and demonstrate that the learner improves when asking questions. Finally, real experiments with Mechanical Turk validate the approach.
Researcher Affiliation	Industry	Jiwei Li, Alexander H. Miller, Sumit Chopra, Marc Aurelio Ranzato, Jason Weston Facebook AI Research, New York, USA {jiwel,ahm,spchopra,ranzato,jase}@fb.com
Pseudocode	No	We use the REINFORCE algorithm (Williams, 1992) to update PRL(Question) and PRL(Answer). The paper describes the algorithm used but does not provide pseudocode or a clearly labeled algorithm block.
Open Source Code	Yes	Code and data are available at https://github.com/facebook/MemNN/tree/master/AskingQuestions.
Open Datasets	Yes	For our experiments we adapt the Wiki Movies dataset (Weston et al., 2015), which consists of roughly 100k questions over 75k entities based on questions with answers in the open movie dataset (OMDb). The training/dev/test sets respectively contain 181638 / 9702 / 9698 examples.
Dataset Splits	Yes	The training/dev/test sets respectively contain 181638 / 9702 / 9698 examples.
Hardware Specification	No	The paper does not provide any specific hardware details such as GPU models, CPU types, or memory amounts used for running experiments.
Software Dependencies	No	For both ofﬂine supervised and online RL settings, we use the End-to-End Memory Network model (Mem N2N) (Sukhbaatar et al., 2015) as a backbone. We use the REINFORCE algorithm (Williams, 1992) to update PRL(Question) and PRL(Answer). The paper mentions models and algorithms used, but does not provide specific software dependencies with version numbers.
Experiment Setup	No	In practice, we ﬁnd the following training strategy yields better results: ﬁrst train only PRL(answer), updating gradients only for the policy that predicts the ﬁnal answer. After the bot’s ﬁnal-answer policy is sufﬁciently learned, train both policies in parallel8. We implement this by running 16 epochs in total, updating only the model’s policy for ﬁnal answers in the ﬁrst 8 epochs while updating both policies during the second 8 epochs. We pick the model that achieves the best reward on the dev set during the ﬁnal 8 epochs. Due to relatively large variance for RL models, we repeat each task 5 times and keep the best model on each task. The paper describes aspects of the training process and some hyperparameter tuning strategy, but it does not provide a comprehensive list of specific hyperparameter values (e.g., learning rate, batch size) or a clearly labeled section for detailed experimental setup.