Subtractive Mixture Models via Squaring: Representation and Learning

Authors: Lorenzo Loconte, Aleksanteri Mikulus Sladek, Stefan Mengel, Martin Trapp, Arno Solin, Nicolas Gillis, Antonio Vergari

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, iv) we provide empirical evidence (Sec. 5) that NPC2s can approximate distributions better than monotonic PCs for a variety of experimental settings involving learning from real-world data and distilling intractable models such as large language models to unlock tractable inference (Zhang et al., 2023).
Researcher Affiliation Collaboration Lorenzo Loconte1 Aleksanteri M. Sladek2 Stefan Mengel3 Martin Trapp2 Arno Solin2 Nicolas Gillis4 Antonio Vergari1 1 School of Informatics, University of Edinburgh, UK 2 Department of Computer Science, Aalto University, Finland 3 University of Artois, CNRS, Centre de Recherche en Informatique de Lens (CRIL), France 4 Department of Mathematics and Operational Research, Universit e de Mons, Belgium
Pseudocode Yes Algorithm 1 square Tensorized Circuit(ℓ, R)
Open Source Code Yes The source code, documentation, data sets and scripts needed to reproduce the results and figures, are available at https://github.com/april-tools/squared-npcs.
Open Datasets Yes In Sec. 5 we evaluate NPC2s for density estimation on five multivariate UCI data sets (Dua & Graff, 2017): Power (Hebrail & Berard, 2012), Gas (Fonollosa et al., 2015), Hepmass (Baldi et al., 2016), Mini Boo NE (Roe et al., 2004) and BSDS300 patches (Martin et al., 2001) by following the pre-processing by Papamakarios et al. (2017).
Dataset Splits Yes Given p (x) the distribution modeled by GPT2 over sentences x = [x1, . . . , x D] having maximum length D, we aim to minimize the Kullback-Leibler divergence KL[p | p], where p is modeled by a PC. Minimizing such divergence is equivalent to learn the PC by maximum-likelihood on data sampled by GPT2. Therefore, following the experimental setting by Zhang et al. (2023) we sample a data set of 8M sentences using GPT2 having bounded length D = 32, i.e., with a maximum of D = 32 tokens. Then, we split such sentences into training, validation and test set having proportions 0.85/0.05/0.10, respectively.
Hardware Specification Yes The benchmarks mentioned above and illustrated in Figs. C.1 to C.3 have been run on a single NVIDIA RTX A6000 with 48Gi B of memory.
Software Dependencies No The paper does not explicitly state specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes All models are learned by batched stochastic gradient descent using the Adam optimizer with default learning rate (Kingma & Ba, 2015) and a batch size of 256. The parameters of all mixtures are initialized by sampling uniformly between 0 and 1. Furthermore, monotonicity in (squared) PCs is ensured by exponentiating the parameters.