Probabilistic Boolean Tensor Decomposition

Authors: Tammo Rukat, Chris Holmes, Christopher Yau

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We investigate three real-world data-sets. First, temporal interaction networks in a hospital ward and behavioural data of university students demonstrate the inference of instructive latent patterns. Next, we decompose a tensor with more than 10 billion data points, indicating relations of gene expression in cancer patients. Not only does this demonstrate scalability, it also provides an entirely novel perspective on relational properties of continuous data and, in the present example, on the molecular heterogeneity of cancer.
Researcher Affiliation Academia 1Department of Statistics, University of Oxford, UK 2The Alan Turing Institute, London, UK 3Centre for Computational Biology, Institute of Cancer and Genomic Sciences, University of Birmingham, UK. Correspondence to: Tammo Rukat <tammo.rukat@stats.ox.ac.uk>.
Pseudocode Yes Pseudocode for the computational procedure is given in Algorithm 1.
Open Source Code Yes Our implementation is available on Git Hub2. 2https://github.com/Tammo R/Logical Factorisation Machines
Open Datasets Yes Records of contact between pairs of individuals in a university hospital have originally been acquired to investigate transmission routes of infectious diseases (Vanhems et al., 2013). Our second example is part of the student-life dataset introduced by Harari et al. (2017) and given by records of the seating positions of students throughout a 9-week Android programming course. The publicly available TCGA dataset (Weinstein et al., 2013) contains gene expression measurements of a large variety of cancer patients across different types of cancer.
Dataset Splits Yes In the second approach, cross validation, we treat 20% of the data as unobserved during training and choose the model dimensionality that achieves the highest posterior predictive accuracy on the held-out data.
Hardware Specification No With these moderately sized datasets of less than 100,000 data-points, sampling until convergence and drawing 50 samples takes only few seconds on a single core. Eventually, we turn to a large-scale biological example, analysing networks of relative geneexpression in cancer patients with more than 10 billion data points. Here, the inference procedure takes around 10 hours.
Software Dependencies No The paper does not specify software dependencies with version numbers.
Experiment Setup Yes Posterior samples of the factors are drawn, following the procedure described in Section 3.3 with λ initialised to 0.5 and the initial factors drawn i.i.d. Bern(0.5).