reproducibilityindex.ai

Opponent Modeling in Deep Reinforcement Learning

Authors: He He, Jordan Boyd-Graber, Kevin Kwok, Hal Daumé III

ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our models on a simulated soccer game and a popular trivia game, showing superior performance over DQN and its variants.
Researcher Affiliation	Academia	He He HHE@UMIACS.UMD.EDU University of Maryland, College Park, MD 20740 USA Jordan Boyd-Graber JORDAN.BOYD.GRABER@COLORADO.EDU University of Colorado, Boulder, CO 80309 USA Kevin Kwok KKWOK@MIT.EDU Massachusetts Institute of Technology, Cambridge, MA 02139 USA Hal Daum e III HAL@UMIACS.UMD.EDU University of Maryland, College Park, MD 20740 USA
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code and data: https://github.com/hhexiy/opponent
Open Datasets	Yes	We collect question/answer pairs and log user buzzes from Protobowl, an online multi-player quizbowl application. Additionally, we include data from Boyd-Graber et al. (2012).
Dataset Splits	Yes	We divide all questions into two nonoverlapping sets: one for training the content model and one for training the buzzing policy. The two sets are further divided into train/dev and train/dev/test sets randomly.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models, or cloud computing instance specifications used for experiments.
Software Dependencies	No	The paper mentions optimization algorithms and neural network components (e.g., Ada Grad, GRU) but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	All systems are trained under the same Q-learning framework. Unless stated otherwise, the experiments have the following conﬁguration: discount factor γ is 0.9, parameters are optimized by Ada Grad (Duchi et al., 2011) with a learning rate of 0.0005, and the mini-batch size is 64. We use ϵ-greedy exploration during training, starting with an exploration rate of 0.3 that linearly decays to 0.1 within 500,000 steps. We train all models for ﬁfty epochs. Cross Entropy is used as the loss in multitasking learning.