Learning through Dialogue Interactions by Asking Questions

Authors: Jiwei Li, Alexander H. Miller, Sumit Chopra, Marc'Aurelio Ranzato, Jason Weston

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We investigate how a learner can benefit from asking questions in both offline and online reinforcement learning settings, and demonstrate that the learner improves when asking questions. Finally, real experiments with Mechanical Turk validate the approach.
Researcher Affiliation Industry Jiwei Li, Alexander H. Miller, Sumit Chopra, Marc Aurelio Ranzato, Jason Weston Facebook AI Research, New York, USA {jiwel,ahm,spchopra,ranzato,jase}@fb.com
Pseudocode No We use the REINFORCE algorithm (Williams, 1992) to update PRL(Question) and PRL(Answer). The paper describes the algorithm used but does not provide pseudocode or a clearly labeled algorithm block.
Open Source Code Yes Code and data are available at https://github.com/facebook/MemNN/tree/master/AskingQuestions.
Open Datasets Yes For our experiments we adapt the Wiki Movies dataset (Weston et al., 2015), which consists of roughly 100k questions over 75k entities based on questions with answers in the open movie dataset (OMDb). The training/dev/test sets respectively contain 181638 / 9702 / 9698 examples.
Dataset Splits Yes The training/dev/test sets respectively contain 181638 / 9702 / 9698 examples.
Hardware Specification No The paper does not provide any specific hardware details such as GPU models, CPU types, or memory amounts used for running experiments.
Software Dependencies No For both offline supervised and online RL settings, we use the End-to-End Memory Network model (Mem N2N) (Sukhbaatar et al., 2015) as a backbone. We use the REINFORCE algorithm (Williams, 1992) to update PRL(Question) and PRL(Answer). The paper mentions models and algorithms used, but does not provide specific software dependencies with version numbers.
Experiment Setup No In practice, we find the following training strategy yields better results: first train only PRL(answer), updating gradients only for the policy that predicts the final answer. After the bot’s final-answer policy is sufficiently learned, train both policies in parallel8. We implement this by running 16 epochs in total, updating only the model’s policy for final answers in the first 8 epochs while updating both policies during the second 8 epochs. We pick the model that achieves the best reward on the dev set during the final 8 epochs. Due to relatively large variance for RL models, we repeat each task 5 times and keep the best model on each task. The paper describes aspects of the training process and some hyperparameter tuning strategy, but it does not provide a comprehensive list of specific hyperparameter values (e.g., learning rate, batch size) or a clearly labeled section for detailed experimental setup.