Robust Market Making via Adversarial Reinforcement Learning

Authors: Thomas Spooner, Rahul Savani

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically compare two conventional single-agent RL agents with ARL, and show that our ARL approach leads to: 1) the emergence of risk-averse behaviour without constraints or domain-specific penalties; 2) significant improvements in performance across a set of standard metrics, evaluated with or without an adversary in the test environment, and; 3) improved robustness to model uncertainty. We empirically demonstrate that our ARL method consistently converges, and we prove for several special cases that the profiles that we converge to correspond to Nash equilibria in a simplified single-stage game.
Researcher Affiliation Academia Thomas Spooner and Rahul Savani Department of Computer Science, University of Liverpool {t.spooner, rahul.savani}@liverpool.ac.uk
Pseudocode No The paper describes algorithms such as 'NAC-S(λ)' and adaptations of 'RARL', but it does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Software. All our code is freely accessible on Git Hub: https://github.com/tspooner/rmm.arl.
Open Datasets No The paper explicitly states it uses an analytical model rather than a data-driven approach: 'Using an analytical model allows us to examine the characteristics of adversarial training in isolation while minimising systematic error due to bias often present in historical data.' Therefore, no publicly available or open dataset is used.
Dataset Splits No The paper describes simulation parameters and training duration (e.g., 'value function was pre-trained for 1000 episodes', 'trained for 10^6 episodes'), but it does not specify traditional dataset splits (e.g., train/validation/test percentages or sample counts) as it utilizes a simulation model to generate data on the fly rather than using a fixed dataset.
Hardware Specification No The paper does not provide any specific hardware details such as GPU models, CPU specifications, or memory used for running the experiments. It only describes the software and training setup.
Software Dependencies No The paper mentions using 'NAC-S(λ) algorithm' and 'semi-gradient SARSA(λ)' for policy evaluation, but it does not list any specific software packages, libraries, or solvers with version numbers (e.g., Python 3.8, PyTorch 1.9, TensorFlow 2.x).
Experiment Setup Yes In each of the experiments to follow, the value function was pre-trained for 1000 episodes (with a learning rate of 10 3) to reduce variance in early policy updates. Both the value function and policy were then trained for 106 episodes, with policy updates every 100 time steps, and a learning rate of 10 4 for both the critic and policy. The value function was configured to learn λ = 0.97 returns. The starting time was chosen uniformly at random from the interval t0 [0.0, 0.95], with starting price Z0 = 100 and inventory H0 [H = 50, H = 50]. Innovations in Zn occurred with fixed volatility σ = 2 between [t0, 1] with increment t = 0.005.