Opponent Modeling in Deep Reinforcement Learning
Authors: He He, Jordan Boyd-Graber, Kevin Kwok, Hal Daumé III
ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our models on a simulated soccer game and a popular trivia game, showing superior performance over DQN and its variants. |
| Researcher Affiliation | Academia | He He HHE@UMIACS.UMD.EDU University of Maryland, College Park, MD 20740 USA Jordan Boyd-Graber JORDAN.BOYD.GRABER@COLORADO.EDU University of Colorado, Boulder, CO 80309 USA Kevin Kwok KKWOK@MIT.EDU Massachusetts Institute of Technology, Cambridge, MA 02139 USA Hal Daum e III HAL@UMIACS.UMD.EDU University of Maryland, College Park, MD 20740 USA |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and data: https://github.com/hhexiy/opponent |
| Open Datasets | Yes | We collect question/answer pairs and log user buzzes from Protobowl, an online multi-player quizbowl application. Additionally, we include data from Boyd-Graber et al. (2012). |
| Dataset Splits | Yes | We divide all questions into two nonoverlapping sets: one for training the content model and one for training the buzzing policy. The two sets are further divided into train/dev and train/dev/test sets randomly. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, or cloud computing instance specifications used for experiments. |
| Software Dependencies | No | The paper mentions optimization algorithms and neural network components (e.g., Ada Grad, GRU) but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | All systems are trained under the same Q-learning framework. Unless stated otherwise, the experiments have the following configuration: discount factor γ is 0.9, parameters are optimized by Ada Grad (Duchi et al., 2011) with a learning rate of 0.0005, and the mini-batch size is 64. We use ϵ-greedy exploration during training, starting with an exploration rate of 0.3 that linearly decays to 0.1 within 500,000 steps. We train all models for fifty epochs. Cross Entropy is used as the loss in multitasking learning. |