Symbolic Metamodels for Interpreting Black-Boxes Using Primitive Functions

Authors: Mahed Abroshan, Saumitra Mishra, Mohammad Mahdi Khalili

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using several experiments, we show that our method outperforms recent metamodeling approaches suggested for interpreting black-boxes. We evaluate and compare our proposed method using three experiments. In the first experiment, we use our method to approximate four functions with simple expressions. In the second experiment, we use our method for estimating instance-wise feature importance for three synthetic datasets. Finally, in the third experiment, we consider blackboxes trained on real data and approximate it using the metamodel.
Researcher Affiliation Collaboration Mahed Abroshan1, Saumitra Mishra2,*, Mohammad Mahdi Khalili3,4 1 The Alan Turing Institute, London, UK 2 JP Morgan AI Research, London, UK 3 Yahoo! Research, NYC, NY, USA 4 CSE Department, The Ohio State University, Columbus, Ohio, USA
Pseudocode Yes A pseudo code of the algorithm and a flowchart is presented in Appendix A.
Open Source Code No The paper does not explicitly state that the authors' code is open source or provide a link for it. It only mentions using a third-party library (`gplearn`).
Open Datasets Yes Three synthetic datasets are used: XOR, Nonlinear additive features, and Feature switching. First, we train a 2-layer neural network f(x) with 200 hidden neurons for estimating the label of each data point. Then, we run our algorithm to find a function g(x) to estimate function f(x). We consider the coefficient of each feature in the Taylor series of g(x) as a metric for its importance. The larger the coefficient, the more important it will be. We rank the features based on their importance. We consider 1000 data points, repeat the process for each data point and find the median feature importance ranking. The median value of relevant features determines the accuracy of the algorithm; the smaller median rank implies a better accuracy. We evaluate the performance of our model on interpreting a black-box trained on real data, replicating the second experiment of (Crabbe et al. 2020). A Multilayer Perceptron (MLP), and Support Vector Machine (SVM) are trained as two black boxes using UCI dataset Yacht (Dua and Graff 2017) (additional results are reported in Appendix E).
Dataset Splits No We randomly use 80% of the data points for the training of the black box model as well as SMPF model, and the remaining 20% is used to evaluate the performance of the model. The paper specifies a train/test split but does not explicitly mention a distinct validation split for hyperparameter tuning.
Hardware Specification No for the last experiment, the training of SP for the MLP black-box takes 215 minutes, while the training of our algorithm takes 45 minutes (both performed on a personal computer). This does not specify any particular hardware components or their specifications.
Software Dependencies No We train the MLP and SVM models using the scikit-learn library (Buitinck et al. 2013) with the default parameters. we use gplearn library for the implementation of SR. No specific version numbers are provided for these libraries.
Experiment Setup Yes l1 and l2 are important hyperparameters, determining number of middle nodes. For each of Li middle nodes, a random subset of bottom layers will be chosen to be connected to this node. At first instance, for all 1 u Li and 1 v d, we connect hu and xv with probability 0 < p0. Then if there exists an xv that is not connected to any of the middle nodes. We choose 1 u Li uniformly at random and then connect xv and hu to ensure every xi is connected to at least one of the middle nodes. p0 is the parameter that controls the sparsity of the produced graphs, which is one of the main factors that determine the complexity of the training procedure. Each edge is representing a function from our class of primitive functions, thus we uniformly at random choose one of the function classes for each edge and also initialize its parameters with samples from the normal distribution. In the training phase, for each tree, we update the parameters of each edge using gradient descent. We choose a constant k and apply k gradient descent updates on the parameters of functions ghi and gij. In this work, we choose a fixed learning rate and leave the exploration of using more advanced optimization techniques for future work. Regularization: We can modify the fitness criterion to favor simpler models. For encouraging sparsity of the tree, we can add a term to the MSE error for penalizing trees that have more edges. Denoting the total number of edges with E, we use this criterion for evaluating the fitness of the trees (λ is a hyperparameter): Fitness of a given tree = MSE + λE. Some additional results and the hyperparameters are reported in Appendix E.