Robust Market Making via Adversarial Reinforcement Learning
Authors: Thomas Spooner, Rahul Savani
IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically compare two conventional single-agent RL agents with ARL, and show that our ARL approach leads to: 1) the emergence of risk-averse behaviour without constraints or domain-specific penalties; 2) significant improvements in performance across a set of standard metrics, evaluated with or without an adversary in the test environment, and; 3) improved robustness to model uncertainty. We empirically demonstrate that our ARL method consistently converges, and we prove for several special cases that the profiles that we converge to correspond to Nash equilibria in a simplified single-stage game. |
| Researcher Affiliation | Academia | Thomas Spooner and Rahul Savani Department of Computer Science, University of Liverpool {t.spooner, rahul.savani}@liverpool.ac.uk |
| Pseudocode | No | The paper describes algorithms such as 'NAC-S(λ)' and adaptations of 'RARL', but it does not include any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Software. All our code is freely accessible on Git Hub: https://github.com/tspooner/rmm.arl. |
| Open Datasets | No | The paper explicitly states it uses an analytical model rather than a data-driven approach: 'Using an analytical model allows us to examine the characteristics of adversarial training in isolation while minimising systematic error due to bias often present in historical data.' Therefore, no publicly available or open dataset is used. |
| Dataset Splits | No | The paper describes simulation parameters and training duration (e.g., 'value function was pre-trained for 1000 episodes', 'trained for 10^6 episodes'), but it does not specify traditional dataset splits (e.g., train/validation/test percentages or sample counts) as it utilizes a simulation model to generate data on the fly rather than using a fixed dataset. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU models, CPU specifications, or memory used for running the experiments. It only describes the software and training setup. |
| Software Dependencies | No | The paper mentions using 'NAC-S(λ) algorithm' and 'semi-gradient SARSA(λ)' for policy evaluation, but it does not list any specific software packages, libraries, or solvers with version numbers (e.g., Python 3.8, PyTorch 1.9, TensorFlow 2.x). |
| Experiment Setup | Yes | In each of the experiments to follow, the value function was pre-trained for 1000 episodes (with a learning rate of 10 3) to reduce variance in early policy updates. Both the value function and policy were then trained for 106 episodes, with policy updates every 100 time steps, and a learning rate of 10 4 for both the critic and policy. The value function was configured to learn λ = 0.97 returns. The starting time was chosen uniformly at random from the interval t0 [0.0, 0.95], with starting price Z0 = 100 and inventory H0 [H = 50, H = 50]. Innovations in Zn occurred with fixed volatility σ = 2 between [t0, 1] with increment t = 0.005. |