reproducibilityindex.ai

Structure Your Data: Towards Semantic Graph Counterfactuals

Authors: Angeliki Dimitriou, Maria Lymperaiou, Georgios Filandrianos, Konstantinos Thomas, Giorgos Stamou

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We apply our method to benchmark and realworld datasets with varying difficulty and availability of semantic annotations. Testing on diverse classifiers, we find that our CEs outperform previous Sot A explanation models based on semantics, including both white and black-box as well as conceptual and pixel-level approaches. Their superiority is proven quantitatively and qualitatively, as validated by human subjects, highlighting the significance of leveraging semantic edges in the presence of intricate relationships. Our model-agnostic graph-based approach is widely applicable and easily extensible, producing actionable explanations across different contexts.
Researcher Affiliation	Academia	Angeliki Dimitriou 1 Maria Lymperaiou 1 Giorgos Filandrianos 1 Konstantinos Thomas 1 Giorgos Stamou 1 Artificial Intelligence and Learning Systems Laboratory, National Technical University of Athens. Correspondence to: Angeliki Dimitriou <angelikidim@ails.ece.ntua.gr>.
Pseudocode	No	The paper provides mathematical equations and descriptions of the method, but it does not include any pseudocode or algorithm blocks.
Open Source Code	Yes	The code is available at https://github.com/aggeliki-dimitriou/SGCE.
Open Datasets	Yes	We experiment with Caltech-UCSD Birds (CUB) (Wah et al., 2011), despite its lack of ground truth scene graphs. We employ Visual Genome (VG) (Krishna et al., 2017), a dataset containing over 108k human-annotated scene graphs... We provide CEs using the Smarty4covid dataset (Zarkogianni et al., 2023) for the IEEE COVID-19 sensor informatics competition winner. The analysis performed on Visual Genome (VG) is extended on the GQA dataset (Hudson & Manning, 2019). We test our method in a real-world image dataset extracted from Action Genome (Ji et al., 2020).
Dataset Splits	Yes	Evaluation comprises quantitative metrics, as well as human-in-the-loop experiments. Quantitative results are extracted by comparing the ranks retrieved based on our obtained graph embeddings to the ground truth ranks retrieved by GED. To minimize the computational burden, we use lightweight GNNs that accelerate the graph proximity process by mapping all N graphs to the same embedding space. By retrieving the closest embedding to G(A) that belongs to class B = A, GED is computed only once per query during retrieval. Concretely, we approximate the following optimization problem for semantic graphs extracted from any input modality: GED(min\|G(A), G (B)\|), such that A = B (1). The graph structure of data imposes the requirement of defining an absolute similarity metric between graph pairs for the training stage. GED is regarded as the optimal choice despite its computational complexity; computing GED for only N/2 pairs to construct the training set is adequate for achieving high quality representations, as validated experimentally. The experimental workflow is adopted from (Vandenhende et al., 2022), therefore we include all the three stages (pre-learning, learning and testing).
Hardware Specification	Yes	We produce graph representations using a single Tesla K80 GPU, while all other computations are done on a 12-core Intel Core i7-5930K CPU.
Software Dependencies	No	The paper mentions software like 'Py G (Fey & Lenssen, 2019)' and 'DGL (Wang et al., 2019)', but does not provide specific version numbers for these dependencies.
Experiment Setup	Yes	All presented results were achieved using single-layer GNNs of a dimension of 2048, built as explained in Sec. 3 of the main paper. For reproducibility purposes, we report that these models were optimized for a batch size of 32 and trained for 50 epochs, without the use of dropout. The employed optimizer was Adam without weight decay. The respective learning rate varied among GNN variants. To be precise, we used a learning rate of 0.04 for GCN and 0.02 for GAT and GIN. GAT and GIN also have model-specific hyperparameters attention heads and the learnable parameter epsilon respectively. Best results were achieved by leveraging 8 attention heads and setting epsilon to non-learnable.