Relational Concept Bottleneck Models
Authors: Pietro Barbiero, Francesco Giannini, Gabriele Ciravegna, Michelangelo Diligenti, Giuseppe Marra
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section we analyze the following research questions: Generalization Can standard/relational CBMs generalize well in relational tasks? Can standard/relational CBMs generalize in out-of-distribution settings? Interpretability Can relational CBMs provide meaningful explanations for their predictions? Are concept/rule interventions effective in relational CBMs? Efficiency Can relational CBMs generalize in low-data regimes? Can relational CBMs correctly predict concept/task labels with scarce concept train labels? Data & task setup. We investigate our research questions using 7 relational datasets on image classification, link prediction and node classification. We introduce two simple but not trivial relational benchmarks, namely the Tower of Hanoi and Rock-Paper-Scissors (RPS), to demonstrate that standard CBMs cannot even solve very simple relational problems. The Tower of Hanoi is composed of 1000 images of disks positioned at different heights of a tower. Concepts include whether disk i is larger than j (or vice versa) and whether disk i is directly on top of disk j (or vice versa). The task is to predict for each disk whether it is well-positioned or not. The RPS dataset is composed of 200 images showing the characteristic hand-signs. Concepts indicate the object played by each player and the task is to predict whether a player wins, loses, or draws. We also evaluate our methods on real-world benchmark datasets specifically designed for relational learning: Cora, Citeseer, [30], Pub Med [23] and Countries on two increasingly difficult splits [28]. Additional details can be found in App. A.1 and App. A.5. Models. We compare R-CBMs against state-of-the-art concept bottleneck architectures, including CBMs with linear and non-linear task predictors (CBM-Linear and CBM-Deep) [13], a flat version (Flat-CBM) where each prediction is computed as a function of the full set of ground atoms, but also with Feedforward and Relational black-box architectures. We also compared against Deep Stoch Log [33], a state-of-the-art Ne Sy system, and other KGE specific models for the studied KGE tasks. Our relational models include an R-CBM with DCR predictor (R-DCR) and its direct variant, using only 5 supervised examples per-predicate (R-DCR-Low). We also considered a noninterpretable R-CBM version where the predictions are based on an unrestricted predictor processing the atom representations (R-CBM-Emb). In the experiments, the loss function was selected to be the standard cross-entropy loss. Further details are in App. A.2. Evaluation. We measure generalization using standard metrics, i.e., Area Under the ROC curve [9] for multi-class classification, accuracy for binary classification, and Mean Reciprocal Rank (MRR) for link prediction, MRR and Hits@N for KGE tasks. We use these metrics to measure generalization across all experiments, including out-of-distribution scenarios, low-data regimes, and interventions. We report additional experiments and further details in App. A.3. |
| Researcher Affiliation | Academia | Pietro Barbiero Università della Svizzera Italiana University of Cambridge barbiero@tutanota.com Francesco Giannini Scuola Normale Superiore francesco.giannini@sns.it Gabriele Ciravegna Politecnico di Torino gabriele.ciravegna@polito.it Michelangelo Diligenti University of Siena michelangelo.diligenti@unisi.it Giuseppe Marra KU Leuven giuseppe.marra@kuleuven.be |
| Pseudocode | No | The paper describes the model architecture and pipeline conceptually and mathematically, but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code to replicate the experiments presented in this paper is available at https://github.com/ diligmic/RCBM-Neurips2024. We will release all of the code required to recreate our experiments in an MIT-licensed public repository. |
| Open Datasets | Yes | We build the Rock-Paper-Scissors (RPS) dataset by downloading images from Kaggle: https://www.kaggle.com/datasets/drgfreeman/rockpaperscissors? resource=download. We exploit the standard splits of the Planetoid Cora, Citeseer and Pub Med citation networks, as defined in Pytorch Geometric https://pytorch-geometric. readthedocs.io/en/latest/modules/datasets.html. The Countries dataset (ODb L licence) 4 defines a set of countries, regions and sub-regions as basic entities. We used splits and setup from Rocktaschel et al. [28], which reports the basic statistics of the dataset and also defines the tasks S1, S2 used in this paper. |
| Dataset Splits | Yes | In all synthetic tasks, we generate datasets with 3,000 samples and use a traditional 70%-10%-20% random split for training, validation, and testing datasets, respectively. |
| Hardware Specification | Yes | All of our experiments were run on a private machine with 8 Intel(R) Xeon(R) Gold 5218 CPUs (2.30GHz), 64GB of RAM, and 2 Quadro RTX 8000 Nvidia GPUs. |
| Software Dependencies | Yes | For our experiments, we implemented all baselines and methods in Python 3.7 and relied upon open-source libraries such as Py Torch 1.11 [24] (BSD license) and Scikit-learn [25] (BSD license). |
| Experiment Setup | Yes | During training, we then set the weight of the concept loss to λ = 0.1 across all models. We then train all models for 3000 epochs using full batching and a default Adam [12] optimizer with learning rate 10 4. |