Interpretable Concept-Based Memory Reasoning
Authors: David Debot, Pietro Barbiero, Francesco Giannini, Gabriele Ciravegna, Michelangelo Diligenti, Giuseppe Marra
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that CMR achieves better accuracy-interpretability trade-offs to state-of-the-art CBMs, discovers logic rules consistent with ground truths, allows for rule interventions, and allows pre-deployment verification. and 6 Experiments Our experiments aim to answer the following research questions: |
| Researcher Affiliation | Academia | David Debot KU Leuven david.debot@kuleuven.be Pietro Barbiero Universita della Svizzera Italiana University of Cambridge barbiero@tutanota.com Francesco Giannini Scuola Normale Superiore francesco.giannini@sns.it Gabriele Ciravegna DAUIN, Politecnico di Torino gabriele.ciravegna@polito.it Michelangelo Diligenti University of Siena michelangelo.diligenti@unisi.it Giuseppe Marra KU Leuven giuseppe.marra@kuleuven.be |
| Pseudocode | No | The paper describes the model architecture and mathematical derivations but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/daviddebot/CMR. and Our code is publicly available at https://github.com/daviddebot/CMR under the Apache License, Version 2.0. |
| Open Datasets | Yes | We base our experiments on four different datasets commonly used to evaluate CBMs: MNIST+ [22], where the task is to predict the sum of two digits; C-MNIST, where we adapted MNIST to the task of predicting whether a coloured digit is even or odd; MNIST+ , where we removed the concepts for the digits 0 and 1 from the concept set; Celeb A [23], a large-scale face attributes dataset with more than 200K celebrity images, each with 40 concept annotations4; CUB [24], where the task is to predict bird species based on bird characteristics; and CEBa B [25], a text-based task where reviews are classified as positive or negative based on different criteria (e.g. food, ambience, service, etc). And All datasets we used are freely available on the web with licenses: MNIST CC BY-SA 3.0 DEED, CEBa B CC BY 4.0 DEED, CUB MIT License 6, Celeb A The Celeb A dataset is available for non-commercial research purposes only7. |
| Dataset Splits | Yes | In Celeb A, we use a validation split of 8:2, a learning rate of 0.001, a batch size of 1000, and we train for 100 epochs. In CEBa B, we use a validation split of 8:2, a learning rate of 0.001, a batch size of 128, and we train for 100 epochs. In MNIST+ and MNIST+ , we use a validation split of 9:1. |
| Hardware Specification | Yes | The experiments for MNIST+, MNIST+ , C-MNIST, Celeb A and CEBa B were run on a machine with an NVIDIA Ge Force GTX 1080 Ti, Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz with 128 GB RAM. The experiment for CUB and the fine-tuning of the BERT model used for the CEBa B embeddings were run on a machine with i7-10750H CPU, 2.60GHz 12, Ge Force RTX 2060 GPU with 16 GB RAM. |
| Software Dependencies | Yes | For our experiments, we implemented the models in Python 3.11.5 using open source libraries. This includes Py Torch v2.1.1 (BSD license) [48], Py Torch-Lightning v2.1.2 (Apache license 2.0), scikit-learn v1.3.0 (BSD license) [49] and xgboost v2.0.3 (Apache license 2.0). We used CUDA v12.4. Plots were made using Matplotlib v3.8.0 (BSD license) [50]. |
| Experiment Setup | Yes | In Celeb A, we use a validation split of 8:2, a learning rate of 0.001, a batch size of 1000, and we train for 100 epochs. In CEBa B, we use a validation split of 8:2, a learning rate of 0.001, a batch size of 128, and we train for 100 epochs. In CUB, we use a learning rate of 0.001, a batch size of 1280, and we train for 300 epochs. In MNIST+ and MNIST+ , we use a validation split of 9:1. We use a learning rate of 0.0001, a batch size of 512, and we train for 300 epochs. In C-MNIST, we also use a learning rate of 0.0001, a batch size of 512, and we train for 300 epochs. and specific hyperparameter details like CMR uses a rule embedding size of 100, at most 5 rules per task, a β of 30, and we reset the selector every 35 epochs. |