reproducibilityindex.ai

Interpretable Concept-Based Memory Reasoning

Authors: David Debot, Pietro Barbiero, Francesco Giannini, Gabriele Ciravegna, Michelangelo Diligenti, Giuseppe Marra

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that CMR achieves better accuracy-interpretability trade-offs to state-of-the-art CBMs, discovers logic rules consistent with ground truths, allows for rule interventions, and allows pre-deployment verification. and 6 Experiments Our experiments aim to answer the following research questions:
Researcher Affiliation	Academia	David Debot KU Leuven david.debot@kuleuven.be Pietro Barbiero Universita della Svizzera Italiana University of Cambridge barbiero@tutanota.com Francesco Giannini Scuola Normale Superiore francesco.giannini@sns.it Gabriele Ciravegna DAUIN, Politecnico di Torino gabriele.ciravegna@polito.it Michelangelo Diligenti University of Siena michelangelo.diligenti@unisi.it Giuseppe Marra KU Leuven giuseppe.marra@kuleuven.be
Pseudocode	No	The paper describes the model architecture and mathematical derivations but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at https://github.com/daviddebot/CMR. and Our code is publicly available at https://github.com/daviddebot/CMR under the Apache License, Version 2.0.
Open Datasets	Yes	We base our experiments on four different datasets commonly used to evaluate CBMs: MNIST+ [22], where the task is to predict the sum of two digits; C-MNIST, where we adapted MNIST to the task of predicting whether a coloured digit is even or odd; MNIST+ , where we removed the concepts for the digits 0 and 1 from the concept set; Celeb A [23], a large-scale face attributes dataset with more than 200K celebrity images, each with 40 concept annotations4; CUB [24], where the task is to predict bird species based on bird characteristics; and CEBa B [25], a text-based task where reviews are classified as positive or negative based on different criteria (e.g. food, ambience, service, etc). And All datasets we used are freely available on the web with licenses: MNIST CC BY-SA 3.0 DEED, CEBa B CC BY 4.0 DEED, CUB MIT License 6, Celeb A The Celeb A dataset is available for non-commercial research purposes only7.
Dataset Splits	Yes	In Celeb A, we use a validation split of 8:2, a learning rate of 0.001, a batch size of 1000, and we train for 100 epochs. In CEBa B, we use a validation split of 8:2, a learning rate of 0.001, a batch size of 128, and we train for 100 epochs. In MNIST+ and MNIST+ , we use a validation split of 9:1.
Hardware Specification	Yes	The experiments for MNIST+, MNIST+ , C-MNIST, Celeb A and CEBa B were run on a machine with an NVIDIA Ge Force GTX 1080 Ti, Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz with 128 GB RAM. The experiment for CUB and the fine-tuning of the BERT model used for the CEBa B embeddings were run on a machine with i7-10750H CPU, 2.60GHz 12, Ge Force RTX 2060 GPU with 16 GB RAM.
Software Dependencies	Yes	For our experiments, we implemented the models in Python 3.11.5 using open source libraries. This includes Py Torch v2.1.1 (BSD license) [48], Py Torch-Lightning v2.1.2 (Apache license 2.0), scikit-learn v1.3.0 (BSD license) [49] and xgboost v2.0.3 (Apache license 2.0). We used CUDA v12.4. Plots were made using Matplotlib v3.8.0 (BSD license) [50].
Experiment Setup	Yes	In Celeb A, we use a validation split of 8:2, a learning rate of 0.001, a batch size of 1000, and we train for 100 epochs. In CEBa B, we use a validation split of 8:2, a learning rate of 0.001, a batch size of 128, and we train for 100 epochs. In CUB, we use a learning rate of 0.001, a batch size of 1280, and we train for 300 epochs. In MNIST+ and MNIST+ , we use a validation split of 9:1. We use a learning rate of 0.0001, a batch size of 512, and we train for 300 epochs. In C-MNIST, we also use a learning rate of 0.0001, a batch size of 512, and we train for 300 epochs. and specific hyperparameter details like CMR uses a rule embedding size of 100, at most 5 rules per task, a β of 30, and we reset the selector every 35 epochs.