Learning multi-scale local conditional probability models of images

Authors: Zahra Kadkhodaie, Florentin Guth, Stéphane Mallat, Eero P Simoncelli

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test this model on a dataset of face images, which are highly non-stationary and contain large-scale geometric structures. Remarkably, denoising, super-resolution, and image synthesis results all demonstrate that these structures can be captured with significantly smaller conditioning neighborhoods than required by a Markov model implemented in the pixel domain. Our results show that score estimation for large complex images can be reduced to low-dimensional Markov conditional models across scales, alleviating the curse of dimensionality. Using a coarse-to-fine anti-diffusion strategy for drawing samples from the posterior (Kadkhodaie & Simoncelli, 2021), we evaluate the model on denoising, super-resolution, and synthesis, and show that locality and stationarity assumptions hold for conditional RF sizes as small as 9x9 without harming performance. We now evaluate our Markov wavelet conditional model on a denoising task. We use the Celeb A dataset (Liu et al., 2015) at 160x160 resolution. Figure 3 shows that the multi-scale denoiser based on a conditional wavelet Markov model outperforms a conventional denoiser that implements a Markov probability model in the pixel domain.
Researcher Affiliation Academia Zahra Kadkhodaie CDS, New York University zk388@nyu.edu Florentin Guth DI, ENS, CNRS, PSL University florentin.guth@ens.fr Stéphane Mallat Collège de France Flatiron Institute, Simons Foundation stephane.mallat@ens.fr Eero P. Simoncelli CNS, Courant, and CDS, New York University Flatiron Institute, Simons Foundation eero.simoncelli@nyu.edu
Pseudocode Yes Algorithm 1 Sampling via ascent of the log-likelihood gradient from a denoiser residual; Algorithm 2 Wavelet Conditional Synthesis
Open Source Code Yes A software implementation is available at https://github.com/ Lab For Computational Vision/local-probability-models-of-images
Open Datasets Yes We use the Celeb A dataset (Liu et al., 2015) at 160x160 resolution. Train and test images are from the Celeb A HQ dataset (Karras et al., 2018) and of size 320x320.
Dataset Splits Yes For experiments shown in Figure 3 and Figure 4, we use 202, 499 training and 100 test images of resolution 160x160 from the Celeb A dataset (Liu et al., 2015). For experiments shown in Figure 5, Figure 7 and Figure 6, we use 29, 900 train and 100 test images, drawn from the Celeb A HQ dataset (Karras et al., 2018) at 320x320 resolution.
Hardware Specification No The paper mentions 'computing resources of the Flatiron Institute' in the acknowledgments but does not specify any particular GPU/CPU models or other hardware details used for the experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers.
Experiment Setup Yes All networks contain 21 convolutional layers with no subsampling, each consisting of 64 channels. Each layer, except for the first and the last, is followed by a Re LU non-linearity and bias-free batch-normalization. All convolutional kernels in the low-pass CNN are of size 3x3, resulting in a 43x43 RF size and 665, 856 parameters in total. Convolutional kernels in the c CNNs are adjusted to achieve different RF sizes. For example, a 13x13 RF arises from choosing 3x3 kernels in every 4th layer and 1x1 (i.e., pointwise linear combinations across all channels) for the rest, resulting in a total of 214, 144 parameters. We follow the training procedure described in (Mohan* et al., 2020), minimizing the mean squared error in denoisingd images corrupted by i.i.d. Gaussian noise with standard deviations drawn from the range [0, 1] (relative to image intensity range [0, 1]). Training is carried out on batches of size 512. For the examples in Figure 5, Figure 7 and Figure 6, we chose h = 0.01, σ0 = 1, β = 0.1 and σ = 0.01.