Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
The Bottleneck Simulator: A Model-Based Deep Reinforcement Learning Approach
Authors: Iulian Vlad Serban, Chinnadhurai Sankar, Michael Pieper, Joelle Pineau, Yoshua Bengio
JAIR 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we evaluate the Bottleneck Simulator on two natural language processing tasks: a text adventure game and a real-world, complex dialogue response selection task. On both tasks, the Bottleneck Simulator yields excellent performance beating competing approaches. |
| Researcher Affiliation | Academia | Iulian Vlad Serban EMAIL Chinnadhurai Sankar EMAIL Mila (Quebec Artificial Intelligence Institute) Department of Computer Science and Operations Research University of Montreal, Montreal, Canada Michael Pieper EMAIL Polytechnique Montreal Montreal, Canada Joelle Pineau EMAIL Mila (Quebec Artificial Intelligence Institute) School of Computer Science, Mc Gill University Montreal, Canada Yoshua Bengio EMAIL Mila (Quebec Artificial Intelligence Institute) Department of Computer Science and Operations Research University of Montreal, Montreal, Canada |
| Pseudocode | No | The paper describes the model and learning processes using mathematical equations and textual explanations, for example in Section 3 'Bottleneck Simulator' and Section 3.2 'Learning', but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code for the described methodology, nor does it include links to a code repository. It only mentions two demo videos for the dialogue system: 'https://youtu.be/TCVb Ypu9Llo and https://youtu.be/LG482Lz W77Y.' |
| Open Datasets | Yes | The first task is the text adventure game Home World introduced by Narasimhan, Kulkarni, and Barzilay (2015). The task is the 2017 Amazon Alexa Prize Competition (Ram, Prasad, Khatri, Venkatesh, et al., 2017), where a spoken dialogue system must converse coherently and engagingly with humans on popular topics. |
| Dataset Splits | Yes | The training dataset consists of 500, 000 recorded dialogue transitions, of which 70% of the dialogues are used as training set and 30% of the dialogues are used as validation set. In total, we collected 199, 678 labels. These are split into training (train), development (dev) and testing (test) sets consisting of respectively 137,549, 23,298 and 38,831 labels each. |
| Hardware Specification | Yes | The authors wish to thank Amazon for providing Tesla K80 GPUs through the Amazon Web Services platform. Some of the Titan X GPUs used for this research were generously donated by the NVIDIA Corporation. |
| Software Dependencies | No | The paper mentions several techniques and tools such as 'k-means clustering', 'Glove word embeddings', 'MLP', and the 'Adam' optimizer, but does not specify any version numbers for these software components or programming languages used (e.g., Python version, library versions like TensorFlow/PyTorch). |
| Experiment Setup | Yes | We use the same experimental setup and hyper-parameters as Narasimhan et al. (2015) for our baseline. We train an state-action-value function baseline policy parametrized as a feed-forward neural network with Q-learning. We use Glove word embeddings (Pennington, Socher, & Manning, 2014). We train the transition model on the collected episodes. The transition model is a two-layer MLP classifier. The policy is parametrized as an state-action-value function Q(s, a), taking as input the dialogue history s and a candidate response a. Based on the dialogue history s and candidate response a, 1458 features are computed. The Bottleneck Simulator environment model uses a transition distribution parametrized by three independent two-layer MLP models. We use the first-order gradient-descent optimizer Adam (Kingma & Ba, 2015). We experiment with a variety of hyper-parameters, and select the best hyper-parameter combination based on the log-likelihood of the dev set. For the first hidden layer, we experiment with layer sizes in the set: {500, 200, 50}. For the second hidden layer, we experiment with layer sizes in the set: {50, 20, 5}. We use L2 regularization on all model parameters, except for bias parameters. We experiment with L2 regularization coefficients in the set: {10.0, 1.0, 10 1, . . . , 10 9}. Unfortunately, we do not have labels to train the last layer. Therefore, we fix the parameters of the last layer to the vector [1.0, 2.0, 3.0, 4.0, 5.0]. |