Subtractive Mixture Models via Squaring: Representation and Learning
Authors: Lorenzo Loconte, Aleksanteri Mikulus Sladek, Stefan Mengel, Martin Trapp, Arno Solin, Nicolas Gillis, Antonio Vergari
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, iv) we provide empirical evidence (Sec. 5) that NPC2s can approximate distributions better than monotonic PCs for a variety of experimental settings involving learning from real-world data and distilling intractable models such as large language models to unlock tractable inference (Zhang et al., 2023). |
| Researcher Affiliation | Collaboration | Lorenzo Loconte1 Aleksanteri M. Sladek2 Stefan Mengel3 Martin Trapp2 Arno Solin2 Nicolas Gillis4 Antonio Vergari1 1 School of Informatics, University of Edinburgh, UK 2 Department of Computer Science, Aalto University, Finland 3 University of Artois, CNRS, Centre de Recherche en Informatique de Lens (CRIL), France 4 Department of Mathematics and Operational Research, Universit e de Mons, Belgium |
| Pseudocode | Yes | Algorithm 1 square Tensorized Circuit(ℓ, R) |
| Open Source Code | Yes | The source code, documentation, data sets and scripts needed to reproduce the results and figures, are available at https://github.com/april-tools/squared-npcs. |
| Open Datasets | Yes | In Sec. 5 we evaluate NPC2s for density estimation on five multivariate UCI data sets (Dua & Graff, 2017): Power (Hebrail & Berard, 2012), Gas (Fonollosa et al., 2015), Hepmass (Baldi et al., 2016), Mini Boo NE (Roe et al., 2004) and BSDS300 patches (Martin et al., 2001) by following the pre-processing by Papamakarios et al. (2017). |
| Dataset Splits | Yes | Given p (x) the distribution modeled by GPT2 over sentences x = [x1, . . . , x D] having maximum length D, we aim to minimize the Kullback-Leibler divergence KL[p | p], where p is modeled by a PC. Minimizing such divergence is equivalent to learn the PC by maximum-likelihood on data sampled by GPT2. Therefore, following the experimental setting by Zhang et al. (2023) we sample a data set of 8M sentences using GPT2 having bounded length D = 32, i.e., with a maximum of D = 32 tokens. Then, we split such sentences into training, validation and test set having proportions 0.85/0.05/0.10, respectively. |
| Hardware Specification | Yes | The benchmarks mentioned above and illustrated in Figs. C.1 to C.3 have been run on a single NVIDIA RTX A6000 with 48Gi B of memory. |
| Software Dependencies | No | The paper does not explicitly state specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | All models are learned by batched stochastic gradient descent using the Adam optimizer with default learning rate (Kingma & Ba, 2015) and a batch size of 256. The parameters of all mixtures are initialized by sampling uniformly between 0 and 1. Furthermore, monotonicity in (squared) PCs is ensured by exponentiating the parameters. |