Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Scalable Neural Network Geometric Robustness Validation via Hölder Optimisation

Authors: Yanghao Zhang, Panagiotis Kouvaros, Alessio Lomuscio

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments were conducted on a workstation equipped with a 16-core AMD Ryzen 9 9950X CPU, 192 GB of RAM, running Linux kernel 6.14.0-29-generic, and an NVIDIA RTX 5090 GPU with 32 GB of graphics memory. The implementation is in Python; the Hilbert space-filling curve mapping is implemented by using the hilbertcurve library [25]. The experimental evaluation is aimed to evaluate the practical applicability of the approach. We establish this by assessing the scalability of the approach on very large NNs and its reliability in practice.
Researcher Affiliation	Collaboration	Yanghao Zhang1,2 Panagiotis Kouvaros1 Alessio Lomuscio1,2 1 Safe Intelligence, UK 2 Department of Computing, Imperial College London, UK EMAIL
Pseudocode	Yes	The pseudocode of H2V is also outlined in Algorithm 1.
Open Source Code	No	The code will be open-sourced.
Open Datasets	Yes	To assess the scalability of the approach, we report the results obtained on large NNs ranging from Resnet34 to Resnet152 and vision transformers. These point to state-of-the-art scalability of the approach when validating the local robustness of large NNs against geometric perturbations on the Image Net dataset. Beyond image tasks, we show that the method s scalability enables for the first time the robustness validation of large-scale 3D-NNs in video classification tasks against geometric perturbations for long-sequence input frames on Kinetics/UCF101 datasets. ... Indeed, no errors were found in the extensive evaluation reported. ... We also evaluate the correctness of the implementation empirically on Soundness Bench and additional benchmarks from VNN-COMP [4].
Dataset Splits	Yes	Table 1 reports the results obtained for 500 Image Net samples. ... For the evaluation, we randomly selected 100 videos from the dataset and evaluated the robustness of the models against perturbations applied to entire video... We select the first two videos of each class in the test set, which consists of 202 videos in total.
Hardware Specification	Yes	Our experiments were conducted on a workstation equipped with a 16-core AMD Ryzen 9 9950X CPU, 192 GB of RAM, running Linux kernel 6.14.0-29-generic, and an NVIDIA RTX 5090 GPU with 32 GB of graphics memory.
Software Dependencies	No	The implementation is in Python; the Hilbert space-filling curve mapping is implemented by using the hilbertcurve library [25]. The experimental evaluation is aimed to evaluate the practical applicability of the approach. ... Specifically, we evaluated 5 pre-trained NNs from the opensource library Py Torch Video [28] ... the timm (Py Torch Image Models) library.
Experiment Setup	Yes	The verification queries consisted of any combination of input transformation consisting of rotation, translation and isotropic scaling with parameters 20 , 10%, and 10%, respectively. We set the timeout budget for each verification query to 1200s (20 minutes) ... We set the timeout budget for each verification query to 3600s (60 minutes) and report the average runtime of H2V in seconds. ... The algorithm initiates the reliability parameter with r = 1.3, as recommended by [11]. When the algorithm converges (i.e., when it reaches the optimisation budget), the size of the neighbourhood is increased nκ nκ + 1, and the value of the global Hölder constant iteratively loosened (i.e., hk hk 1.3), until a different interval is selected in Step 16 of the algorithm. This process is repeated until the the optimisation budget is reached for the same interval for 25 times