Dialogue Learning With Human-in-the-Loop

Authors: Jiwei Li, Alexander H. Miller, Sumit Chopra, Marc'Aurelio Ranzato, Jason Weston

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper we explore this direction in a reinforcement learning setting where the bot improves its question-answering ability from feedback a teacher gives following its generated responses. We build a simulator that tests various aspects of such learning in a synthetic environment, and introduce models that work in this regime. Finally, real experiments with Mechanical Turk validate the approach.
Researcher Affiliation Industry Jiwei Li, Alexander H. Miller, Sumit Chopra, Marc Aurelio Ranzato, Jason Weston Facebook AI Research, New York, USA {jiwel,ahm,spchopra,ranzato,jase}@fb.com
Pseudocode No The paper describes algorithms (RBI, REINFORCE, FP) but does not present them in a pseudocode block or a clearly labeled algorithm figure.
Open Source Code Yes Code and data are available at https://github.com/facebook/Mem NN/tree/master/HITL.
Open Datasets Yes Following Weston (2016), we use (i) the single supporting fact problem from the b Ab I datasets (Weston et al., 2015)...; and (ii) the Wiki Movies dataset (Weston et al., 2015)...
Dataset Splits Yes We use the same train/valid/test splits. [...] hyperparameters are tuned on a similarly sized validation set.
Hardware Specification No No specific hardware details (like GPU models, CPU types, memory) are mentioned in the paper.
Software Dependencies No The paper mentions 'Mem N2N' model, but it does not specify versions of programming languages, libraries, or frameworks (e.g., Python 3.x, TensorFlow x.x, PyTorch x.x).
Experiment Setup Yes In order to make this work in the online setting which requires exploration to find the correct answer, we employ an ϵ-greedy strategy: the learner makes a prediction using its own model (the answer assigned the highest probability) with probability 1 ϵ, otherwise it picks a random answer with probability ϵ. [...] We use batch size to refer to how many dialogue episodes the current model is used to collect feedback before updating its parameters.