Estimating the Rate-Distortion Function by Wasserstein Gradient Descent

Authors: Yibo Yang, Stephan Eckstein, Marcel Nutz, Stephan Mandt

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally, we obtain comparable or tighter bounds than state-of-the-art neural network methods on low-rate sources while requiring considerably less tuning and computation effort.
Researcher Affiliation Academia 1University of California, Irvine 2ETH Zurich 3Columbia University {yibo.yang, mandt}@uci.edu stephan.eckstein@math.ethz.ch mnutz@columbia.edu
Pseudocode Yes Algorithm 1 Wasserstein gradient descent
Open Source Code Yes Our code and can be found at https://github.com/yiboyang/wgd.
Open Datasets Yes We perform R(D) estimation on higher-dimensional data, including the physics and speech datasets from [Yang and Mandt, 2022] and MNIST [Le Cun et al., 1998].
Dataset Splits No The paper mentions using 'training data' and 'training samples' (e.g., 'm = 100000 samples' and 'MNIST training set') but does not explicitly provide percentages or counts for train/validation/test splits, or refer to predefined splits with citations for reproducibility.
Hardware Specification Yes Our deconvolution experiments were run on Intel(R) Xeon(R) CPUs, and the rest of the experiments were run on Titan RTX GPUs.
Software Dependencies No The paper mentions using the 'Jax library [Bradbury et al., 2018]' but does not provide a specific version number for Jax or any other key software components.
Experiment Setup Yes We use the Adam optimizer for all gradient-based methods, except we use simple gradient descent with a decaying step size in Sec. 5.1 to better compare the convergence speed of WGD and its hybrid variant. For the RD-VAE, we used a similar architecture as the one used on the banana-shaped source in [Yang and Mandt, 2022], consisting of two-layer MLPs for the encoder and decoder networks, and Masked Autoregressive Flow [Papamakarios et al., 2017] for the variational prior. For NERD, we follow similar architecture settings as [Lei et al., 2023a], using a two-layer MLP for the decoder network. We use a two-layer network for NERD and RD-VAE with some hand-tuning (we replace the softplus activation in the original RD-VAE network by Re LU as it led to difficulty in optimization). [...] We use n = 20 particles for BA, NERD, WGD and its hybrid variant. [...] The input to all the algorithms is an empirical measure µm (training distribution) with m = 100000 samples, given the same fixed seed.