Estimating the Rate-Distortion Function by Wasserstein Gradient Descent
Authors: Yibo Yang, Stephan Eckstein, Marcel Nutz, Stephan Mandt
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally, we obtain comparable or tighter bounds than state-of-the-art neural network methods on low-rate sources while requiring considerably less tuning and computation effort. |
| Researcher Affiliation | Academia | 1University of California, Irvine 2ETH Zurich 3Columbia University {yibo.yang, mandt}@uci.edu stephan.eckstein@math.ethz.ch mnutz@columbia.edu |
| Pseudocode | Yes | Algorithm 1 Wasserstein gradient descent |
| Open Source Code | Yes | Our code and can be found at https://github.com/yiboyang/wgd. |
| Open Datasets | Yes | We perform R(D) estimation on higher-dimensional data, including the physics and speech datasets from [Yang and Mandt, 2022] and MNIST [Le Cun et al., 1998]. |
| Dataset Splits | No | The paper mentions using 'training data' and 'training samples' (e.g., 'm = 100000 samples' and 'MNIST training set') but does not explicitly provide percentages or counts for train/validation/test splits, or refer to predefined splits with citations for reproducibility. |
| Hardware Specification | Yes | Our deconvolution experiments were run on Intel(R) Xeon(R) CPUs, and the rest of the experiments were run on Titan RTX GPUs. |
| Software Dependencies | No | The paper mentions using the 'Jax library [Bradbury et al., 2018]' but does not provide a specific version number for Jax or any other key software components. |
| Experiment Setup | Yes | We use the Adam optimizer for all gradient-based methods, except we use simple gradient descent with a decaying step size in Sec. 5.1 to better compare the convergence speed of WGD and its hybrid variant. For the RD-VAE, we used a similar architecture as the one used on the banana-shaped source in [Yang and Mandt, 2022], consisting of two-layer MLPs for the encoder and decoder networks, and Masked Autoregressive Flow [Papamakarios et al., 2017] for the variational prior. For NERD, we follow similar architecture settings as [Lei et al., 2023a], using a two-layer MLP for the decoder network. We use a two-layer network for NERD and RD-VAE with some hand-tuning (we replace the softplus activation in the original RD-VAE network by Re LU as it led to difficulty in optimization). [...] We use n = 20 particles for BA, NERD, WGD and its hybrid variant. [...] The input to all the algorithms is an empirical measure µm (training distribution) with m = 100000 samples, given the same fixed seed. |