Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Identifiable Deep Generative Models via Sparse Decoding
Authors: Gemma Elyse Moran, Dhanya Sridhar, Yixin Wang, David Blei
TMLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically study the sparse VAE with both simulated and real data. We find that it recovers meaningful latent factors and has smaller heldout reconstruction error than related methods. |
| Researcher Affiliation | Academia | Gemma E. Moran EMAIL Columbia University Dhanya Sridhar Mila Quebec AI Institute and Université de Montréal Yixin Wang University of Michigan David M. Blei Columbia University |
| Pseudocode | Yes | Algorithm 1: The sparse VAE |
| Open Source Code | Yes | The sparse VAE implementation may be found at https://github.com/gemoran/sparse-vae-code. |
| Open Datasets | Yes | Peer Read (Kang et al., 2018). Dataset of word counts for paper abstracts (N 10, 000, G = 500). Movie Lens (Harper and Konstan, 2015). Dataset of binary user-movie ratings (N = 100, 000, G = 300). Zeisel (Zeisel et al., 2015). Dataset of RNA molecule counts in mouse cortex cells (N = 3005, G = 558). |
| Dataset Splits | Yes | All results are averaged over five splits of the data, with standard deviation in parentheses. We assess this question using the semi-synthetic Peer Read dataset, where the train and test data were generated by factors with different correlations. |
| Hardware Specification | Yes | GPU: NVIDIA TITAN Xp graphics card (24GB). CPU: Intel E4-2620 v4 processor (64GB). |
| Software Dependencies | No | For stochastic optimization, we use automatic differentiation in Py Torch, with optimization using Adam (Kingma and Ba, 2015) with default settings (beta1=0.9, beta2=0.999) For LDA, we used Python s sklearn package with default settings. |
| Experiment Setup | Yes | Table 6: Settings for each experiment. Synthetic data ... # hidden layers 3 # layer dimension 50 Latent space dimension 5 Learning rate 0.01 Epochs 200 Batch size 100 Loss function Gaussian Sparse VAE λ1 = 1, λ0 = 10 β-VAE [2, 4, 6, 8, 16] VSC α = 0.01 OI-VAE λ = 1, p = 5 Runtime per split CPU, 2 mins |