Payoff Control in the Iterated Prisoner's Dilemma
Authors: Dong Hao, Kai Li, Tao Zhou
IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We simulate serval specific strategies generated under the payoff control framework in a tournament similar to that of Axelrod [Axelrod and Hamilton, 1981], it is found that the new payoff control strategies have remarkably good performances. To analyze the performances of the payoff control strategies when confronting various famous strategies, in section 5, we simulate control strategies in the Axelrod s tournament. In the last section, to evaluate how payoff control strategies perform in the real world, we simulate them against a reinforcement learning player [Sutton and Barto, 1998]. |
| Researcher Affiliation | Academia | 1 University of Electronic Science and Technology of China, Chengdu, China 2 Shanghai Jiao Tong University, Shanghai, China haodong@uestc.edu.cn, kai.li@sjtu.edu.cn, zhutou@ustc.edu |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | No | The paper describes simulations and uses a traditional prisoner's dilemma payoff matrix setting, but it does not use a publicly available or open dataset in the conventional sense, nor does it provide access information for any dataset. |
| Dataset Splits | No | The paper describes simulation parameters such as the number of repetitions and stages for the tournament, but it does not provide details about training, validation, or test dataset splits, which are typically found in machine learning experiments. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., CPU, GPU models) used to run the simulations or experiments. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers (e.g., programming languages, libraries, or solvers). |
| Experiment Setup | Yes | The simulated tournament is similar as in [Stewart and Plotkin, 2012] but uses a different IPD setting (R, T, S, P) = (2, 3, 1, 0). Due to the inherent stochasticity of some strategies, the tournament is repeated 1000 times. In a tournament, each strategy in the above set meets each other (including itself) in a perfect iterated prisoner s dilemma (IPD) game, and each IPD game has 200 stages. Y s strategy q is updated according to the following average-reward value function: Q (ω, a) (1 α) Q (ω, a) + α h r + max a Q (ω , a ) i (22) where Q (ω, a) is an evaluation value of player Y choosing action a after stage game outcome ω. r = r (ω, a, ω ) r is difference between the instantiate reward r and the estimated average reward r . The instantiation reward r (ω, a, ω ) is induced by player Y taking action a after outcome ω and transiting the game to a new outcome ω . α is a free variable for the learning rate. |