Compressing Tabular Data via Latent Variable Estimation

Authors: Andrea Montanari, Eric Weiner

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate this approach on several benchmark datasets, and study optimal compression in a probabilistic model for tabular data, whereby latent values are independent and table entries are conditionally independent given the latent values.
Researcher Affiliation Industry Project N, Mountain View, CA, United States. Correspondence to: Andrea Montanari <am@projectn.co>.
Pseudocode Yes Algorithm 1 Latent-based Tabular Compressor (Page 3); Algorithm 2 Spectral latents estimation (Page 4)
Open Source Code No The paper states it used
Open Datasets Yes Taxicab. A table with m = 62, 495, n = 18 (NYC.gov, 2022).; Network. Four social networks from (Leskovec & Krevl, 2014)...; Card transactions... (Altman, 2019).; Business price index... (stats.govt.nz, 2022).; Forest. A table from the UCI data repository... (Dua & Graff, 2017).; US Census. Another table from (Dua & Graff, 2017)...; Jokes. A collaborative filtering dataset... (Goldberg et al., 2001; Goldberg et al.).
Dataset Splits No The paper describes the datasets used but does not specify any training, validation, or test splits.
Hardware Specification Yes Runtimes were averaged over 5 runs on a Macbook Pro single-threaded with a 2 GHz 4-core Intel i5 chip.
Software Dependencies No The paper mentions using 'Zstandard (ZSTD) Python bindings to the C implementation using the library zstd' and 'the scikit-learn implementation via sklearn.cluster.KMeans', but it does not provide specific version numbers for these libraries.
Experiment Setup Yes We implemented the following two options for the base compressors ZX (for data blocks) and ZL (for latents). Dictionary-based compression (Lempel-Ziv, LZ). For this we used Zstandard (ZSTD) Python bindings to the C implementation using the library zstd, with level 12.; For the clustering step we use the scikit-learn implementation via sklearn.cluster.KMeans, with random initialization.; We choose |Lr|, |Lc| by optimizing the compressed size . We run KMeans on the data 5 times, with random initializations finding the DRR each time and reporting the average.