Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Explainable Data Decompositions
Authors: Sebastian Dalleiger, Jilles Vreeken3709-3716
AAAI 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluation on synthetic and real-world data shows that DISC efficiently discovers meaningful components and accurately characterises these in easily understandable terms. |
| Researcher Affiliation | Academia | Sebastian Dalleiger, Jilles Vreeken CISPA Helmholtz Center for Information Security EMAIL |
| Pseudocode | Yes | Algorithm 1: DESC for Describing the Composition and Algorithm 2: DISC for Discovering the Composition |
| Open Source Code | Yes | We provide the source code, datasets, synthetic dataset generator, and additional information needed for reproducibility.1 and 1https://eda.mmci.uni-saarland.de/disc/ |
| Open Datasets | Yes | We provide the source code, datasets, synthetic dataset generator, and additional information needed for reproducibility.1 and 1https://eda.mmci.uni-saarland.de/disc/ |
| Dataset Splits | No | The paper evaluates on synthetic and real-world datasets but does not explicitly provide details about train/validation/test splits (e.g., percentages, sample counts, or specific split methodologies) for reproduction. |
| Hardware Specification | Yes | We implemented DISC in C++ , ran experiments on a 12-Core Intel Xeon E5-2643 CPU, and report wall-clock time. |
| Software Dependencies | No | The paper states 'We implemented DISC in C++' but does not provide specific version numbers for key software components, libraries, or solvers. |
| Experiment Setup | Yes | In all experiments we have used the same significance level α = 0.01. and Since DBSCAN relies on hyper-parameter, we optimize ℓ using a grid-search over 7 ϵ-candidates and we do not constraint cluster-sizes. |