Learning to Receive Help: Intervention-Aware Concept Embedding Models

Authors: Mateo Espinosa Zarlenga, Katie Collins, Krishnamurthy Dvijotham, Adrian Weller, Zohreh Shams, Mateja Jamnik

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that Int CEMs significantly outperform state-of-the-art concept-interpretable models when provided with test-time concept interventions, demonstrating the effectiveness of our approach. In this section, we evaluate Int CEMs by exploring the following research questions:
Researcher Affiliation Collaboration Mateo Espinosa Zarlenga University of Cambridge me466@cam.ac.uk Katherine M. Collins University of Cambridge kmc61@cam.ac.uk Krishnamurthy (Dj) Dvijotham Google Deep Mind dvij@google.com Adrian Weller University of Cambridge Alan Turing Institute aw665@cam.ac.uk Zohreh Shams University of Cambridge zs315@cam.ac.uk Mateja Jamnik University of Cambridge mateja.jamnik@cl.cam.ac.uk
Pseudocode No The paper describes the architecture and training procedure in text and diagrams (e.g., Figure 2) but does not present formally structured pseudocode or algorithm blocks.
Open Source Code Yes All of our code, including configs and scripts to recreate results shown in this paper, has been released as part of CEM s official public repository found at https://github.com/mateoespinosa/cem.
Open Datasets Yes We consider five vision tasks: (1) MNIST-Add, a task inspired by the UMNIST dataset [29] where one is provided with 12 MNIST [30] images containing digits in {0, , 9}, as well as their digit labels as concept annotations... (3) CUB [31]... (5) Celeb A, a task on the Celebrity Attributes Dataset [32]...
Dataset Splits Yes We monitor the validation loss by sampling 20% of the training set to make our validation set for each task. This validation set is also used to select the best-performing models whose results we report in Section 4.
Hardware Specification Yes All our non-ablation experiments were run in a shared GPU cluster with four Nvidia Titan Xp GPUs and 40 Intel(R) Xeon(R) E5-2630 v4 CPUs (at 2.20GHz) with 125GB of RAM. In contrast, all of our ablation experiments were run on a separate GPU cluster with 4x Nvidia A100-SXM-80GB GPUs, where each GPU is allocated 32 CPUs.
Software Dependencies Yes Our implementation of Int CEM is built on top of that repository using Py Torch 1.12 [40]... their code was written in Tensor Flow, and as such, we converted it to Py Torch. All numerical plots and graphics have been generated using Matplotlib 3.5...
Experiment Setup Yes We always train with stochastic gradient descent with batch size B (selected to well-utilise our hardware), initial learning rate rinitial, and momentum 0.9. Following the hyperparameter selection in reported in [5], we use rinitial = 0.01 for all tasks with the exception of the Celeb A task where this is set to rinitial = 0.005. Finally, to help find better local minima in our loss function during training, we decrease the learning rate by a factor of 0.1 whenever the training loss plateaus for 10 epochs.