Balancing Two-Player Stochastic Games with Soft Q-Learning
Authors: Jordi Grau-Moya, Felix Leibfried, Haitham Bou-Ammar
IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We contribute both theoretically and empirically. On the theory side, we show that games with soft Q-learning exhibit a unique value and generalise team games and zero-sum games far beyond these two extremes to cover a continuous spectrum of gaming behaviour. Experimentally, we show how tuning agents constraints affect performance and demonstrate, through a neural network architecture, how to reliably balance games with high-dimensional representations. |
| Researcher Affiliation | Industry | Jordi Grau-Moya, Felix Leibfried and Haitham Bou-Ammar PROWLER.io jordi@prowler.io, felix@prowler.io, haitham@prowler.io |
| Pseudocode | Yes | Algorithm 1 Two-Player Soft Q-Learning |
| Open Source Code | No | The paper provides a link for 'Game-play videos' but not for the source code of their methodology. 'Game-play videos can be found at https://sites.google.com/site/submission3591/.' |
| Open Datasets | No | The paper mentions a '5x6 grid-world' and 'the game Pong from the Roboschool package' as environments for experiments, but does not provide access information (link, citation) for a specific dataset used for training, as agents are trained in these environments. |
| Dataset Splits | No | The paper describes training in environments (grid-world and Pong) rather than using predefined datasets with explicit train/validation/test splits. No specific data splits were provided for reproducibility. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) were mentioned. |
| Software Dependencies | No | The paper mentions using the 'Roboschool package' and 'ADAM optimizer' but does not specify their version numbers or any other software dependencies with version details. |
| Experiment Setup | Yes | For the low-dimensional experiments, 'For all experiments we used a high learning rate of α = 0.5'. For high-dimensional Pong experiments: 'We used a learning rate of 10^-4, the ADAM optimizer, a batch size of 32, and updated the target every 30000 training steps.' |