A Deep Bayesian Policy Reuse Approach Against Non-Stationary Agents

Authors: YAN ZHENG, Zhaopeng Meng, Jianye Hao, Zongzhang Zhang, Tianpei Yang, Changjie Fan

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This section presents empirical results on a gridworld game adapted from [8], a navigation game adapted from [3], and a soccer game adapted from [6, 15]. Comparisons among BPR [20], BPR+ [10] and deep BPR+ are performed to verify their performance. In all games involving a non-stationary agent, deep BPR+ is empirically verified in terms of detection accuracy, cumulative reward and learning speed of a new response policy.
Researcher Affiliation Collaboration Yan Zheng1, Zhaopeng Meng1, Jianye Hao1 , Zongzhang Zhang2, Tianpei Yang1, Changjie Fan3 1College of Intelligence and Computing, Tianjin University, Tianjin, China 2School of Computer Science and Technology, Soochow University, Suzhou, China 3Net Ease Fuxi Lab, Net Ease, Inc., Hangzhou, China
Pseudocode Yes Algorithm 1: Deep BPR+ Input: Episodes K, policy library Π, known opponent policy set T , performance model P(U|T , Π) 1 Initialize beliefs β0 with uniform distribution 2 for episode t = 1 ... K do 3 if execute a reuse stage then 4 Choose a policy πt based on βt 1 to execute, and receive utility ut (see Equation 11) 5 Estimate opponent s online policy ˆτ t o based on its observed behaviors 6 Update rectified belief model βt using ut and ˆτ t o (see Equation 10) 7 if a new policy is detected by moving averaged reward then 8 Initialize policy πt by distilled policy network, and then switch to learning stage 9 else if execute a learning stage then 10 Optimize πt by DQN, and estimate opponent s online policy ˆτ t o 11 if an optimal policy is obtained then 12 Update T , Π and P(U|T , Π), and then switch to the reuse stage
Open Source Code No The paper does not provide an explicit statement about the release of source code or a link to a code repository.
Open Datasets No The paper uses adapted versions of gridworld, navigation, and soccer games as environments for experiments, but does not specify a publicly available dataset, nor does it provide a link or formal citation to one. Data is implicitly generated during interaction with these environments.
Dataset Splits No The paper conducts experiments in simulated game environments rather than on traditional datasets with explicit training, validation, and test splits described by percentages or sample counts.
Hardware Specification No The paper does not explicitly provide details about the specific hardware (e.g., GPU/CPU models, memory specifications) used to run the experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., programming languages, libraries, or solvers with their versions) that would be needed to replicate the experiment.
Experiment Setup No The paper states 'Detailed network architecture of deep BPR+ and corresponding hyperparameters are described in Appendix.', indicating that these specific details are not present in the main body of the paper.