Specifying and Testing k-Safety Properties for Machine-Learning Models

Authors: Maria Christakis, Hasan Ferit Eniser, Jörg Hoffmann, Adish Singla, Valentin Wüstholz

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the effectiveness of our testing framework in detecting property violations across a broad range of different domains. We also perform a feasibility study to showcase how such violations can be used to improve model training. ... We trained models using seven datasets from five application domains as follows: ... For each of these models, we wrote one or more NOMOS specifications to capture potentially desired properties, for a total of 32 properties (see Appx. A for a complete list). Each test harness used a budget of 5000 (see line 1 of Fig. 4), that is, it generated 5000 test cases satisfying the precondition, if any. We ran each harness with 10 different random seeds to account for randomness in the testing procedure. Here, we report arithmetic means (e.g., for the number of bugs) unless stated otherwise. ... Tab. 1 gives an overview of the number of specified properties, violated properties, and unique bugs per dataset and model. Our framework was able to find violations for all datasets, and in particular, for 26 of these properties see Tab. 2. Most violations were exhibited through tens or hundreds of unique tests. This demonstrates that our framework is effective in detecting bugs even with as few as 5000 tests per property;
Researcher Affiliation Collaboration Maria Christakis1 , Hasan Ferit Eniser2 , Jörg Hoffmann3,4 , Adish Singla2 and Valentin Wüstholz5 1TU Wien, Austria 2MPI-SWS, Germany 3Saarland University, Saarland Informatics Campus, Germany 4German Research Center for Artificial Intelligence (DFKI), Germany 5Consen Sys, Austria
Pseudocode Yes Figure 1: Example k-safety specifications in NOMOS. (a) COMPAS 2-safety property. ... Figure 4: Snippet of generated harness for the specification of Fig. 1a.
Open Source Code Yes We design and implement a fully automated, publicly available framework1 for validating such properties using metamorphic testing. 1https://github.com/Rigorous-Software-Engineering/nomos
Open Datasets Yes Tabular data. We used the COMPAS [Larson et al., 2016] and German Credit [Hofmann, 1994] datasets; ... Images. Using the MNIST dataset [Le Cun et al., 1999], ... Speech. Similarly, for the Speech Command dataset [Warden, 2018], ... Natural language. The Hotel Review dataset [Liu, 2017] ... Action policies. Lunar Lander is a popular Gym [Brockman et al., 2016] environment ... We also used Bipedal Walker, another popular Gym [Brockman et al., 2016] environment
Dataset Splits No The paper mentions using a 'test set' for input sources and reports 'test accuracy' for the trained models, but it does not explicitly specify the proportions (e.g., percentages or sample counts) for training, validation, and test splits for its experiments, nor does it cite a source defining these specific splits.
Hardware Specification Yes We used a machine with a Quadro RTX 8000 GPU and an Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz for training models and running tests.
Software Dependencies No The paper mentions 'ANTLR4' and 'SB3 library' but does not provide specific version numbers for these software components. For example, it states 'PPO [Schulman et al., 2017] implemented in the SB3 library [Raffin et al., 2019]' but lacks a version for SB3.
Experiment Setup Yes Each test harness used a budget of 5000 (see line 1 of Fig. 4), that is, it generated 5000 test cases satisfying the precondition, if any. We ran each harness with 10 different random seeds to account for randomness in the testing procedure. ... In particular, we adjusted the existing training algorithm (PPO [Schulman et al., 2017] implemented in the SB3 library [Raffin et al., 2019]) to start episodes not only from random initial states, but also from buggy states. ... our guided-training algorithm tests the current policy at regular intervals (every 5 rollouts in our experiments)...