reproducibilityindex.ai

Compressing Tabular Data via Latent Variable Estimation

Authors: Andrea Montanari, Eric Weiner

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate this approach on several benchmark datasets, and study optimal compression in a probabilistic model for tabular data, whereby latent values are independent and table entries are conditionally independent given the latent values.
Researcher Affiliation	Industry	Project N, Mountain View, CA, United States. Correspondence to: Andrea Montanari <am@projectn.co>.
Pseudocode	Yes	Algorithm 1 Latent-based Tabular Compressor (Page 3); Algorithm 2 Spectral latents estimation (Page 4)
Open Source Code	No	The paper states it used
Open Datasets	Yes	Taxicab. A table with m = 62, 495, n = 18 (NYC.gov, 2022).; Network. Four social networks from (Leskovec & Krevl, 2014)...; Card transactions... (Altman, 2019).; Business price index... (stats.govt.nz, 2022).; Forest. A table from the UCI data repository... (Dua & Graff, 2017).; US Census. Another table from (Dua & Graff, 2017)...; Jokes. A collaborative ﬁltering dataset... (Goldberg et al., 2001; Goldberg et al.).
Dataset Splits	No	The paper describes the datasets used but does not specify any training, validation, or test splits.
Hardware Specification	Yes	Runtimes were averaged over 5 runs on a Macbook Pro single-threaded with a 2 GHz 4-core Intel i5 chip.
Software Dependencies	No	The paper mentions using 'Zstandard (ZSTD) Python bindings to the C implementation using the library zstd' and 'the scikit-learn implementation via sklearn.cluster.KMeans', but it does not provide specific version numbers for these libraries.
Experiment Setup	Yes	We implemented the following two options for the base compressors ZX (for data blocks) and ZL (for latents). Dictionary-based compression (Lempel-Ziv, LZ). For this we used Zstandard (ZSTD) Python bindings to the C implementation using the library zstd, with level 12.; For the clustering step we use the scikit-learn implementation via sklearn.cluster.KMeans, with random initialization.; We choose \|Lr\|, \|Lc\| by optimizing the compressed size . We run KMeans on the data 5 times, with random initializations ﬁnding the DRR each time and reporting the average.