reproducibilityindex.ai

Relational Concept Bottleneck Models

Authors: Pietro Barbiero, Francesco Giannini, Gabriele Ciravegna, Michelangelo Diligenti, Giuseppe Marra

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section we analyze the following research questions: Generalization Can standard/relational CBMs generalize well in relational tasks? Can standard/relational CBMs generalize in out-of-distribution settings? Interpretability Can relational CBMs provide meaningful explanations for their predictions? Are concept/rule interventions effective in relational CBMs? Efficiency Can relational CBMs generalize in low-data regimes? Can relational CBMs correctly predict concept/task labels with scarce concept train labels? Data & task setup. We investigate our research questions using 7 relational datasets on image classification, link prediction and node classification. We introduce two simple but not trivial relational benchmarks, namely the Tower of Hanoi and Rock-Paper-Scissors (RPS), to demonstrate that standard CBMs cannot even solve very simple relational problems. The Tower of Hanoi is composed of 1000 images of disks positioned at different heights of a tower. Concepts include whether disk i is larger than j (or vice versa) and whether disk i is directly on top of disk j (or vice versa). The task is to predict for each disk whether it is well-positioned or not. The RPS dataset is composed of 200 images showing the characteristic hand-signs. Concepts indicate the object played by each player and the task is to predict whether a player wins, loses, or draws. We also evaluate our methods on real-world benchmark datasets specifically designed for relational learning: Cora, Citeseer, [30], Pub Med [23] and Countries on two increasingly difficult splits [28]. Additional details can be found in App. A.1 and App. A.5. Models. We compare R-CBMs against state-of-the-art concept bottleneck architectures, including CBMs with linear and non-linear task predictors (CBM-Linear and CBM-Deep) [13], a flat version (Flat-CBM) where each prediction is computed as a function of the full set of ground atoms, but also with Feedforward and Relational black-box architectures. We also compared against Deep Stoch Log [33], a state-of-the-art Ne Sy system, and other KGE specific models for the studied KGE tasks. Our relational models include an R-CBM with DCR predictor (R-DCR) and its direct variant, using only 5 supervised examples per-predicate (R-DCR-Low). We also considered a noninterpretable R-CBM version where the predictions are based on an unrestricted predictor processing the atom representations (R-CBM-Emb). In the experiments, the loss function was selected to be the standard cross-entropy loss. Further details are in App. A.2. Evaluation. We measure generalization using standard metrics, i.e., Area Under the ROC curve [9] for multi-class classification, accuracy for binary classification, and Mean Reciprocal Rank (MRR) for link prediction, MRR and Hits@N for KGE tasks. We use these metrics to measure generalization across all experiments, including out-of-distribution scenarios, low-data regimes, and interventions. We report additional experiments and further details in App. A.3.
Researcher Affiliation	Academia	Pietro Barbiero Università della Svizzera Italiana University of Cambridge barbiero@tutanota.com Francesco Giannini Scuola Normale Superiore francesco.giannini@sns.it Gabriele Ciravegna Politecnico di Torino gabriele.ciravegna@polito.it Michelangelo Diligenti University of Siena michelangelo.diligenti@unisi.it Giuseppe Marra KU Leuven giuseppe.marra@kuleuven.be
Pseudocode	No	The paper describes the model architecture and pipeline conceptually and mathematically, but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code to replicate the experiments presented in this paper is available at https://github.com/ diligmic/RCBM-Neurips2024. We will release all of the code required to recreate our experiments in an MIT-licensed public repository.
Open Datasets	Yes	We build the Rock-Paper-Scissors (RPS) dataset by downloading images from Kaggle: https://www.kaggle.com/datasets/drgfreeman/rockpaperscissors? resource=download. We exploit the standard splits of the Planetoid Cora, Citeseer and Pub Med citation networks, as defined in Pytorch Geometric https://pytorch-geometric. readthedocs.io/en/latest/modules/datasets.html. The Countries dataset (ODb L licence) 4 defines a set of countries, regions and sub-regions as basic entities. We used splits and setup from Rocktaschel et al. [28], which reports the basic statistics of the dataset and also defines the tasks S1, S2 used in this paper.
Dataset Splits	Yes	In all synthetic tasks, we generate datasets with 3,000 samples and use a traditional 70%-10%-20% random split for training, validation, and testing datasets, respectively.
Hardware Specification	Yes	All of our experiments were run on a private machine with 8 Intel(R) Xeon(R) Gold 5218 CPUs (2.30GHz), 64GB of RAM, and 2 Quadro RTX 8000 Nvidia GPUs.
Software Dependencies	Yes	For our experiments, we implemented all baselines and methods in Python 3.7 and relied upon open-source libraries such as Py Torch 1.11 [24] (BSD license) and Scikit-learn [25] (BSD license).
Experiment Setup	Yes	During training, we then set the weight of the concept loss to λ = 0.1 across all models. We then train all models for 3000 epochs using full batching and a default Adam [12] optimizer with learning rate 10 4.