Payoff Control in the Iterated Prisoner's Dilemma

Authors: Dong Hao, Kai Li, Tao Zhou

IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We simulate serval specific strategies generated under the payoff control framework in a tournament similar to that of Axelrod [Axelrod and Hamilton, 1981], it is found that the new payoff control strategies have remarkably good performances. To analyze the performances of the payoff control strategies when confronting various famous strategies, in section 5, we simulate control strategies in the Axelrod s tournament. In the last section, to evaluate how payoff control strategies perform in the real world, we simulate them against a reinforcement learning player [Sutton and Barto, 1998].
Researcher Affiliation Academia 1 University of Electronic Science and Technology of China, Chengdu, China 2 Shanghai Jiao Tong University, Shanghai, China haodong@uestc.edu.cn, kai.li@sjtu.edu.cn, zhutou@ustc.edu
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets No The paper describes simulations and uses a traditional prisoner's dilemma payoff matrix setting, but it does not use a publicly available or open dataset in the conventional sense, nor does it provide access information for any dataset.
Dataset Splits No The paper describes simulation parameters such as the number of repetitions and stages for the tournament, but it does not provide details about training, validation, or test dataset splits, which are typically found in machine learning experiments.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., CPU, GPU models) used to run the simulations or experiments.
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., programming languages, libraries, or solvers).
Experiment Setup Yes The simulated tournament is similar as in [Stewart and Plotkin, 2012] but uses a different IPD setting (R, T, S, P) = (2, 3, 1, 0). Due to the inherent stochasticity of some strategies, the tournament is repeated 1000 times. In a tournament, each strategy in the above set meets each other (including itself) in a perfect iterated prisoner s dilemma (IPD) game, and each IPD game has 200 stages. Y s strategy q is updated according to the following average-reward value function: Q (ω, a) (1 α) Q (ω, a) + α h r + max a Q (ω , a ) i (22) where Q (ω, a) is an evaluation value of player Y choosing action a after stage game outcome ω. r = r (ω, a, ω ) r is difference between the instantiate reward r and the estimated average reward r . The instantiation reward r (ω, a, ω ) is induced by player Y taking action a after outcome ω and transiting the game to a new outcome ω . α is a free variable for the learning rate.