Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Distributional Adversarial Attacks and Training in Deep Hedging

Authors: Guangyi He, Tobias Sutter, Lukas Gonon

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive numerical experiments, we show that adversarially trained deep hedging strategies consistently outperform their classical counterparts in terms of out-of-sample performance and resilience to model misspecification. Additional results indicate that the robust strategies maintain reliable performance on real market data and remain effective during periods of market change. Our findings establish a practical and effective framework for robust deep hedging under realistic market uncertainties.
Researcher Affiliation	Academia	Guangyi He Department of Mathematics Imperial College London EMAIL Tobias Sutter Department of Economics University of St. Gallen EMAIL Lukas Gonon School of Computer Science University of St. Gallen EMAIL Department of Mathematics Imperial College London EMAIL
Pseudocode	Yes	Algorithm 1 Wasserstein Projection Gradient Descent (WPGD) 1: for i = 1 to num_of_iteration do 2: Compute ˆΥ := ( 1 N PN n=1 xl DH(θ; ˆSn) q )1/q 3: for n = 1 to N do 4: ˆSn ˆSn + β sign( xl DH(θ; ˆSn)) xl(θ; ˆSn) q 1 ˆΥ1 q, 5: end for 6: ˆS1, . . . , ˆSn Proj ˆ Bδ(µ)(ˆS1, . . . , ˆSn) 7: end for
Open Source Code	Yes	1The code is available on https://github.com/Guangyi-Mira/Distributional-Adversarial Attacks-and-Training-in-Deep-Hedging
Open Datasets	Yes	For this evaluation, we constructed two synthetic datasets based on an additional model introduced in the appendix. The FIX dataset is simulated using the General Affine Diffusion (GAD) model with fixed parameters estimated from a 250-day period prior to 8 March 2020. The ROBUST dataset, following [11], is also based on the GAD model but incorporates parameter robustness by sampling parameters uniformly from intervals determined by the extreme values across 26 rolling estimates. ... Specifically, we train hedging strategies using historical daily closing prices from leading companies in the S&P 500 index from [47], covering the period from 26 September 2008 to 8 March 2020. [47] Ran Aroussi. yfinance: Yahoo! finance market data downloader. https://github.com/ ranaroussi/yfinance, 2019.
Dataset Splits	Yes	For each model, we generate extensive training datasets of 100,000 sample paths. To examine robustness across varying dataset sizes, we partition each dataset into smaller subsets with sizes N ranging from 5,000 to 100,000 samples. Neural networks are independently trained on these subsets, and the average performance is assessed and reported on a fixed test set, which contains 1 million paths, generated from the same distribution. In addition, we generate a validation set of 100,000 paths, but only N paths will be used for validation, so that the training is exposed to only a limited number of data depending on N.
Hardware Specification	Yes	All computational runs are conducted without GPU on AMD EPYC 7742 or Intel Icelake Xeon Platinum 8358 processors equipped with less than 64GB of memory.
Software Dependencies	No	The paper mentions using the Adam optimizer and describes network architectures with batch normalization and ReLU activation, but it does not specify version numbers for any software libraries, frameworks, or programming languages used (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	Our training procedure begins with a preliminary phase of clean training to establish stable initial parameters. Specifically, this phase lasts 100 epochs for the BS model and 300 epochs for the more complex Heston model. Subsequently, the network undergoes adversarial training for an additional 200 epochs (BS) or 400 epochs (Heston), alternating adversarial example generation and optimization of Eq. (5.1). For comparison, we train baseline networks (clean strategies) exclusively with clean training for an equivalent total duration (300 epochs for BS, 700 epochs for Heston). Optimizer and Learning rate. Optimization utilizes the Adam optimizer, with decaying learning rate initially set to 0.005 for BS and 0.05 for Heston. The batch size is set to 10,000 unless the dataset size N is smaller, in which case the entire dataset is utilized per batch. Hyperparameters. Critical adversarial training hyperparameters include α, tested at 0, 1, 10 to gauge the relative influence of clean versus adversarial loss, and perturbation magnitude δ, explored across {0.001, 0.003, 0.005, 0.01, 0.03, 0.05, 0.1, 0.3, 0.5}. Hyperparameter selection is performed by evaluating performance on a validation set of size N and selecting the hyperparameters yielding the best validation results. Adversarial attack. During the experiment, we employ the WBPGD algorithm detailed in Algorithm 2 for adversarial attacks. We execute this algorithm for 20 iterations, setting the step-size as β = 4/20δ, which is dependent on the perturbation magnitude δ.