Efficiently Factorizing Boolean Matrices using Proximal Gradient Descent

Authors: Sebastian Dalleiger, Jilles Vreeken

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through an extensive set of experiments, we demonstrate that our method works well in practice: On synthetic data, we show that it converges quickly, recovers the ground truth precisely, and estimates the simulated rank exactly. On real-world data, we improve upon the state of the art in recall, loss, and runtime, and a case study from the medical domain confirms that our results are easily interpretable and semantically meaningful.
Researcher Affiliation Academia Sebastian Dalleiger CISPA Helmholtz Center for Information Security sebastian.dalleiger@cispa.de Jilles Vreeken CISPA Helmholtz Center for Information Security jv@cispa.de
Pseudocode Yes Figure 1: On the left, we show the three regularizers: Bowl, PRIMP, and ELB, for λ = = 0.5, and see that only our ELB regularizer penalizes non-Boolean values well. On the right, we show our method ELBMF as pseudocode. Algorithm 1: ELBMF
Open Source Code Yes We provide the source code, datasets, synthetic dataset generator, and other information needed for reproducibility.1 Appendix C; DOI: 10.5281/zenodo.7187021
Open Datasets Yes We use nine publicly available datasets2 from different domains. To cover the biomedical domain, we extract the network containing empirical evidence of protein-protein interactions in Homo sapiens from the STRING database. From the GRAND repository, we take the gene regulatory networks sampled from Glioblastoma (GBM) and Lower Grade Glioma (LGG) brain cancer tissues, as well as from non-cancerous Cerebellum tissue. The TCGA dataset contains binarized gene expressions from cancer patients, and we further obtain the single nucleotide polymorphism (SNP) mutation data from the 1k Genomes project, following processing steps from the authors of BINAPS [9]. In the entertainment domain, we use the user-movie datasets Movielens and Netflix, binarizing the original 5-star-scale ratings by setting only reviews with more than 3.5 stars to 1. Finally, as data from the innovation domain, we derive a directed citation network between patent groups from patent citation and classification data provided by Patents View. For each dataset with a given number of groups, such as cancer types or movie genres, we set the matrix rank k to 33 (TCGA), 28 (Genomes), 136 (Patents), 20 (Movielens), and 20 (Netflix). When the number of subgroups is unknown, we estimate the rank that minimizes MDL using ELBMF, resulting in 100 (GBM), 32 (LGG), 100 (String), and 450 (Cerebellum).
Dataset Splits No The paper discusses synthetic and real-world datasets but does not explicitly provide details about training, validation, and test splits (percentages or sample counts) for reproducibility of data partitioning.
Hardware Specification Yes We implement ELBMF in the Julia language and run experiments on 16 Cores of an AMD EPYC 7702 and a single NVIDIA A100 GPU, reporting wall-clock time.
Software Dependencies No The paper states 'We implement ELBMF in the Julia language' but does not provide specific version numbers for Julia or any other ancillary software libraries used in the experiments.
Experiment Setup Yes We run each method on our synthetic datasets, targeting a matrix rank of 5. To account for random fluctuations, we average over 10 randomly drawn sets of 5 ground-truth tiles per 10% increment in noise probability. ... Hence, we gradually increase λ to prevent a subpar solution and achieve a Boolean outcome, using a regularization rate λt = λ ·γ t for t ≥ 0 (13) that gradually increases the proximal distance at a user-defined rate.