Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Compressing Tabular Data via Latent Variable Estimation
Authors: Andrea Montanari, Eric Weiner
ICML 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate this approach on several benchmark datasets, and study optimal compression in a probabilistic model for tabular data, whereby latent values are independent and table entries are conditionally independent given the latent values. |
| Researcher Affiliation | Industry | Project N, Mountain View, CA, United States. Correspondence to: Andrea Montanari <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Latent-based Tabular Compressor (Page 3); Algorithm 2 Spectral latents estimation (Page 4) |
| Open Source Code | No | The paper states it used |
| Open Datasets | Yes | Taxicab. A table with m = 62, 495, n = 18 (NYC.gov, 2022).; Network. Four social networks from (Leskovec & Krevl, 2014)...; Card transactions... (Altman, 2019).; Business price index... (stats.govt.nz, 2022).; Forest. A table from the UCI data repository... (Dua & Graff, 2017).; US Census. Another table from (Dua & Graff, 2017)...; Jokes. A collaborative filtering dataset... (Goldberg et al., 2001; Goldberg et al.). |
| Dataset Splits | No | The paper describes the datasets used but does not specify any training, validation, or test splits. |
| Hardware Specification | Yes | Runtimes were averaged over 5 runs on a Macbook Pro single-threaded with a 2 GHz 4-core Intel i5 chip. |
| Software Dependencies | No | The paper mentions using 'Zstandard (ZSTD) Python bindings to the C implementation using the library zstd' and 'the scikit-learn implementation via sklearn.cluster.KMeans', but it does not provide specific version numbers for these libraries. |
| Experiment Setup | Yes | We implemented the following two options for the base compressors ZX (for data blocks) and ZL (for latents). Dictionary-based compression (Lempel-Ziv, LZ). For this we used Zstandard (ZSTD) Python bindings to the C implementation using the library zstd, with level 12.; For the clustering step we use the scikit-learn implementation via sklearn.cluster.KMeans, with random initialization.; We choose |Lr|, |Lc| by optimizing the compressed size . We run KMeans on the data 5 times, with random initializations finding the DRR each time and reporting the average. |