Scaling Up Probabilistic Circuits by Latent Variable Distillation

Authors: Anji Liu, Honghua Zhang, Guy Van den Broeck

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on both image and language modeling benchmarks (e.g., Image Net and Wiki Text-2) show that latent variable distillation substantially boosts the performance of large PCs compared to their counterparts without latent variable distillation. In particular, on the image modeling benchmarks, PCs achieve competitive performance against some of the widely-used deep generative models, including variational autoencoders and flow-based models, opening up new avenues for tractable generative modeling.
Researcher Affiliation Academia Anji Liu , Honghua Zhang & Guy Van den Broeck Department of Computer Science University of California, Los Angeles {liuanji,hzhang19,guyvdb}@cs.ucla.edu
Pseudocode Yes Algorithm 1 Materializing a LV in a PC
Open Source Code Yes Our code can be found at https://github.com/UCLA-Star AI/LVD.
Open Datasets Yes Experiments on both image and language modeling benchmarks (e.g., Image Net and Wiki Text-2) show that latent variable distillation substantially boosts the performance of large PCs compared to their counterparts without latent variable distillation. In particular, on the image modeling benchmarks, PCs achieve competitive performance against some of the widely-used deep generative models, including variational autoencoders and flow-based models, opening up new avenues for tractable generative modeling. (Image Net is cited as Deng et al., 2009; Wiki Text-2 as Merity et al., 2016; CIFAR as Krizhevsky et al., 2009)
Dataset Splits Yes To facilitate training and evaluation, we pre-process the tokens from Wiki Text-2 by concatenating them into one giant token sequence and collect all subsequences of length 32 to construct the train, validation and test sets, respectively.
Hardware Specification Yes All experiments are run on servers/workstations with the following configuration: 32 CPUs, 128G Mem, 4 NVIDIA A5000 GPU; 32 CPUs, 64G Mem, 1 NVIDIA Ge Force RTX 3090 GPU; 64 CPUs, 128G Mem, 3 NVIDIA A100 GPU.
Software Dependencies No The paper mentions software such as "Juice.jl" and "PyTorch" but does not specify their version numbers, which is required for reproducible software dependencies. For example: "We implement both HCLT and RATSPN using the Julia package Juice.jl (Dang et al., 2021)" and "We use the original Py Torch implementation of Ei Net and similarly tune their hyperparameters."
Experiment Setup Yes All HMM models are trained with mini-batch EM (Appx. B.1) for two phases: in phase 1, the model is trained with learning rate 0.1 for 20 epochs; in phase 2, the model is trained with learning rate 0.01 for 5 epochs. Note that for HMM models with hidden states 750, we train for 30 epochs in phase 1. The number of epochs are selected such that all model converges before training stops. ... When optimizing the model with the MLE lower bound, we adopt mini-batch EM (Appx. B.1) with learning rate annealed linearly from 0.1 to 0.01. In the latent distribution training step (Sec. Section 4), we anneal learning rate from 0.1 to 0.001.