reproducibilityindex.ai

A Deep Bayesian Policy Reuse Approach Against Non-Stationary Agents

Authors: YAN ZHENG, Zhaopeng Meng, Jianye Hao, Zongzhang Zhang, Tianpei Yang, Changjie Fan

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This section presents empirical results on a gridworld game adapted from [8], a navigation game adapted from [3], and a soccer game adapted from [6, 15]. Comparisons among BPR [20], BPR+ [10] and deep BPR+ are performed to verify their performance. In all games involving a non-stationary agent, deep BPR+ is empirically veriﬁed in terms of detection accuracy, cumulative reward and learning speed of a new response policy.
Researcher Affiliation	Collaboration	Yan Zheng1, Zhaopeng Meng1, Jianye Hao1 , Zongzhang Zhang2, Tianpei Yang1, Changjie Fan3 1College of Intelligence and Computing, Tianjin University, Tianjin, China 2School of Computer Science and Technology, Soochow University, Suzhou, China 3Net Ease Fuxi Lab, Net Ease, Inc., Hangzhou, China
Pseudocode	Yes	Algorithm 1: Deep BPR+ Input: Episodes K, policy library Π, known opponent policy set T , performance model P(U\|T , Π) 1 Initialize beliefs β0 with uniform distribution 2 for episode t = 1 ... K do 3 if execute a reuse stage then 4 Choose a policy πt based on βt 1 to execute, and receive utility ut (see Equation 11) 5 Estimate opponent s online policy ˆτ t o based on its observed behaviors 6 Update rectiﬁed belief model βt using ut and ˆτ t o (see Equation 10) 7 if a new policy is detected by moving averaged reward then 8 Initialize policy πt by distilled policy network, and then switch to learning stage 9 else if execute a learning stage then 10 Optimize πt by DQN, and estimate opponent s online policy ˆτ t o 11 if an optimal policy is obtained then 12 Update T , Π and P(U\|T , Π), and then switch to the reuse stage
Open Source Code	No	The paper does not provide an explicit statement about the release of source code or a link to a code repository.
Open Datasets	No	The paper uses adapted versions of gridworld, navigation, and soccer games as environments for experiments, but does not specify a publicly available dataset, nor does it provide a link or formal citation to one. Data is implicitly generated during interaction with these environments.
Dataset Splits	No	The paper conducts experiments in simulated game environments rather than on traditional datasets with explicit training, validation, and test splits described by percentages or sample counts.
Hardware Specification	No	The paper does not explicitly provide details about the specific hardware (e.g., GPU/CPU models, memory specifications) used to run the experiments.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., programming languages, libraries, or solvers with their versions) that would be needed to replicate the experiment.
Experiment Setup	No	The paper states 'Detailed network architecture of deep BPR+ and corresponding hyperparameters are described in Appendix.', indicating that these specific details are not present in the main body of the paper.