Compressing Tabular Data via Latent Variable Estimation
Authors: Andrea Montanari, Eric Weiner
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate this approach on several benchmark datasets, and study optimal compression in a probabilistic model for tabular data, whereby latent values are independent and table entries are conditionally independent given the latent values. |
| Researcher Affiliation | Industry | Project N, Mountain View, CA, United States. Correspondence to: Andrea Montanari <am@projectn.co>. |
| Pseudocode | Yes | Algorithm 1 Latent-based Tabular Compressor (Page 3); Algorithm 2 Spectral latents estimation (Page 4) |
| Open Source Code | No | The paper states it used |
| Open Datasets | Yes | Taxicab. A table with m = 62, 495, n = 18 (NYC.gov, 2022).; Network. Four social networks from (Leskovec & Krevl, 2014)...; Card transactions... (Altman, 2019).; Business price index... (stats.govt.nz, 2022).; Forest. A table from the UCI data repository... (Dua & Graff, 2017).; US Census. Another table from (Dua & Graff, 2017)...; Jokes. A collaborative filtering dataset... (Goldberg et al., 2001; Goldberg et al.). |
| Dataset Splits | No | The paper describes the datasets used but does not specify any training, validation, or test splits. |
| Hardware Specification | Yes | Runtimes were averaged over 5 runs on a Macbook Pro single-threaded with a 2 GHz 4-core Intel i5 chip. |
| Software Dependencies | No | The paper mentions using 'Zstandard (ZSTD) Python bindings to the C implementation using the library zstd' and 'the scikit-learn implementation via sklearn.cluster.KMeans', but it does not provide specific version numbers for these libraries. |
| Experiment Setup | Yes | We implemented the following two options for the base compressors ZX (for data blocks) and ZL (for latents). Dictionary-based compression (Lempel-Ziv, LZ). For this we used Zstandard (ZSTD) Python bindings to the C implementation using the library zstd, with level 12.; For the clustering step we use the scikit-learn implementation via sklearn.cluster.KMeans, with random initialization.; We choose |Lr|, |Lc| by optimizing the compressed size . We run KMeans on the data 5 times, with random initializations finding the DRR each time and reporting the average. |