Dialogue Learning With Human-in-the-Loop
Authors: Jiwei Li, Alexander H. Miller, Sumit Chopra, Marc'Aurelio Ranzato, Jason Weston
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper we explore this direction in a reinforcement learning setting where the bot improves its question-answering ability from feedback a teacher gives following its generated responses. We build a simulator that tests various aspects of such learning in a synthetic environment, and introduce models that work in this regime. Finally, real experiments with Mechanical Turk validate the approach. |
| Researcher Affiliation | Industry | Jiwei Li, Alexander H. Miller, Sumit Chopra, Marc Aurelio Ranzato, Jason Weston Facebook AI Research, New York, USA {jiwel,ahm,spchopra,ranzato,jase}@fb.com |
| Pseudocode | No | The paper describes algorithms (RBI, REINFORCE, FP) but does not present them in a pseudocode block or a clearly labeled algorithm figure. |
| Open Source Code | Yes | Code and data are available at https://github.com/facebook/Mem NN/tree/master/HITL. |
| Open Datasets | Yes | Following Weston (2016), we use (i) the single supporting fact problem from the b Ab I datasets (Weston et al., 2015)...; and (ii) the Wiki Movies dataset (Weston et al., 2015)... |
| Dataset Splits | Yes | We use the same train/valid/test splits. [...] hyperparameters are tuned on a similarly sized validation set. |
| Hardware Specification | No | No specific hardware details (like GPU models, CPU types, memory) are mentioned in the paper. |
| Software Dependencies | No | The paper mentions 'Mem N2N' model, but it does not specify versions of programming languages, libraries, or frameworks (e.g., Python 3.x, TensorFlow x.x, PyTorch x.x). |
| Experiment Setup | Yes | In order to make this work in the online setting which requires exploration to find the correct answer, we employ an ϵ-greedy strategy: the learner makes a prediction using its own model (the answer assigned the highest probability) with probability 1 ϵ, otherwise it picks a random answer with probability ϵ. [...] We use batch size to refer to how many dialogue episodes the current model is used to collect feedback before updating its parameters. |