Opponent Modeling in Deep Reinforcement Learning

Authors: He He, Jordan Boyd-Graber, Kevin Kwok, Hal Daumé III

ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our models on a simulated soccer game and a popular trivia game, showing superior performance over DQN and its variants.
Researcher Affiliation Academia He He HHE@UMIACS.UMD.EDU University of Maryland, College Park, MD 20740 USA Jordan Boyd-Graber JORDAN.BOYD.GRABER@COLORADO.EDU University of Colorado, Boulder, CO 80309 USA Kevin Kwok KKWOK@MIT.EDU Massachusetts Institute of Technology, Cambridge, MA 02139 USA Hal Daum e III HAL@UMIACS.UMD.EDU University of Maryland, College Park, MD 20740 USA
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code and data: https://github.com/hhexiy/opponent
Open Datasets Yes We collect question/answer pairs and log user buzzes from Protobowl, an online multi-player quizbowl application. Additionally, we include data from Boyd-Graber et al. (2012).
Dataset Splits Yes We divide all questions into two nonoverlapping sets: one for training the content model and one for training the buzzing policy. The two sets are further divided into train/dev and train/dev/test sets randomly.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, or cloud computing instance specifications used for experiments.
Software Dependencies No The paper mentions optimization algorithms and neural network components (e.g., Ada Grad, GRU) but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes All systems are trained under the same Q-learning framework. Unless stated otherwise, the experiments have the following configuration: discount factor γ is 0.9, parameters are optimized by Ada Grad (Duchi et al., 2011) with a learning rate of 0.0005, and the mini-batch size is 64. We use ϵ-greedy exploration during training, starting with an exploration rate of 0.3 that linearly decays to 0.1 within 500,000 steps. We train all models for fifty epochs. Cross Entropy is used as the loss in multitasking learning.