Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

The Bottleneck Simulator: A Model-Based Deep Reinforcement Learning Approach

Authors: Iulian Vlad Serban, Chinnadhurai Sankar, Michael Pieper, Joelle Pineau, Yoshua Bengio

JAIR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Finally, we evaluate the Bottleneck Simulator on two natural language processing tasks: a text adventure game and a real-world, complex dialogue response selection task. On both tasks, the Bottleneck Simulator yields excellent performance beating competing approaches.
Researcher Affiliation	Academia	Iulian Vlad Serban EMAIL Chinnadhurai Sankar EMAIL Mila (Quebec Artiﬁcial Intelligence Institute) Department of Computer Science and Operations Research University of Montreal, Montreal, Canada Michael Pieper EMAIL Polytechnique Montreal Montreal, Canada Joelle Pineau EMAIL Mila (Quebec Artiﬁcial Intelligence Institute) School of Computer Science, Mc Gill University Montreal, Canada Yoshua Bengio EMAIL Mila (Quebec Artiﬁcial Intelligence Institute) Department of Computer Science and Operations Research University of Montreal, Montreal, Canada
Pseudocode	No	The paper describes the model and learning processes using mathematical equations and textual explanations, for example in Section 3 'Bottleneck Simulator' and Section 3.2 'Learning', but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any explicit statements about releasing source code for the described methodology, nor does it include links to a code repository. It only mentions two demo videos for the dialogue system: 'https://youtu.be/TCVb Ypu9Llo and https://youtu.be/LG482Lz W77Y.'
Open Datasets	Yes	The ﬁrst task is the text adventure game Home World introduced by Narasimhan, Kulkarni, and Barzilay (2015). The task is the 2017 Amazon Alexa Prize Competition (Ram, Prasad, Khatri, Venkatesh, et al., 2017), where a spoken dialogue system must converse coherently and engagingly with humans on popular topics.
Dataset Splits	Yes	The training dataset consists of 500, 000 recorded dialogue transitions, of which 70% of the dialogues are used as training set and 30% of the dialogues are used as validation set. In total, we collected 199, 678 labels. These are split into training (train), development (dev) and testing (test) sets consisting of respectively 137,549, 23,298 and 38,831 labels each.
Hardware Specification	Yes	The authors wish to thank Amazon for providing Tesla K80 GPUs through the Amazon Web Services platform. Some of the Titan X GPUs used for this research were generously donated by the NVIDIA Corporation.
Software Dependencies	No	The paper mentions several techniques and tools such as 'k-means clustering', 'Glove word embeddings', 'MLP', and the 'Adam' optimizer, but does not specify any version numbers for these software components or programming languages used (e.g., Python version, library versions like TensorFlow/PyTorch).
Experiment Setup	Yes	We use the same experimental setup and hyper-parameters as Narasimhan et al. (2015) for our baseline. We train an state-action-value function baseline policy parametrized as a feed-forward neural network with Q-learning. We use Glove word embeddings (Pennington, Socher, & Manning, 2014). We train the transition model on the collected episodes. The transition model is a two-layer MLP classiﬁer. The policy is parametrized as an state-action-value function Q(s, a), taking as input the dialogue history s and a candidate response a. Based on the dialogue history s and candidate response a, 1458 features are computed. The Bottleneck Simulator environment model uses a transition distribution parametrized by three independent two-layer MLP models. We use the ﬁrst-order gradient-descent optimizer Adam (Kingma & Ba, 2015). We experiment with a variety of hyper-parameters, and select the best hyper-parameter combination based on the log-likelihood of the dev set. For the ﬁrst hidden layer, we experiment with layer sizes in the set: {500, 200, 50}. For the second hidden layer, we experiment with layer sizes in the set: {50, 20, 5}. We use L2 regularization on all model parameters, except for bias parameters. We experiment with L2 regularization coeﬃcients in the set: {10.0, 1.0, 10 1, . . . , 10 9}. Unfortunately, we do not have labels to train the last layer. Therefore, we ﬁx the parameters of the last layer to the vector [1.0, 2.0, 3.0, 4.0, 5.0].