Contrastive Divergence Learning is a Time Reversal Adversarial Game
Authors: Omer Yair, Tomer Michaeli
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Figure 3: A toy example illustrating the importance of the adversarial nature of CD. Here, the data lies close to a 2D spiral embedded in a 10-dimensional space. (a) The training samples in the first 3 dimensions. (b) Three different approaches for learning the distribution: CNCE with large contrastive variance (top), CNCE with small contrastive variance (middle), and CD based on Langevin dynamics MCMC with the weight adjustment described in Sec. 3.4 (bottom). As can be seen in the first two columns, CD adapts the contrastive samples according to the data distribution, whereas CNCE does not. Therefore, CNCE with large variance fails to learn the distribution because the vast majority of its contrastive samples are far from the manifold and quickly become irrelevant (as indicated by the weights αθ in the third column). And CNCE with small variance fails to learn the global structure of the distribution because its contrastive samples are extremely close to the dataset samples. CD, on the other hand, adjusts the contrastive distribution during training, so as to generate samples that are close to the manifold yet traverse large distances along it. Figure 4: Here, we use different CD configurations for learning the model of Fig. 3. All configurations use Langevin dynamics as their MCMC process, but with different ways of compensating for the lack of detailed balance. From left to right we have the ground-truth density, CD w/o any correction, CD with Metropolis-Hastings rejection, and CD with our proposed adjustment. To illustrate our observations, we now conclude with a simple toy example (see Fig. 3). |
| Researcher Affiliation | Academia | Yair Omer Department of Electrical Engineering Technion Israel Institute of Technology Haifa, Israel omeryair@gmail.com Tomer Michaeli Department of Electrical Engineering Technion Israel Institute of Technology Haifa, Israel tomer.m@ee.technion.ac.il |
| Pseudocode | Yes | Algorithm 1: Contrastive Divergence k Algorithm 2: Adjusted Contrastive Divergence k |
| Open Source Code | Yes | The code for reproducing the results is available at (for the blind review the code will be available in the supplementary material). |
| Open Datasets | No | The paper describes generating a synthetic dataset: "We take data concentrated around a 2-dimensional manifold embedded in 10-dimensional space. Specifically, let e(1), . . . , e(10) denote the standard basis in R10. Then each data sample is generated by adding Gaussian noise to a random point along a 2D spiral lying in the e(1)-e(2) plane." No public dataset is cited or linked. |
| Dataset Splits | No | The paper does not provide explicit training, validation, or test dataset splits (e.g., percentages or sample counts). It mentions "training samples" but no detailed partitioning. |
| Hardware Specification | No | The paper does not specify any hardware used for running the experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions using a "multi-layer perceptron (MLP)" but does not provide specific software names with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific libraries with their versions). |
| Experiment Setup | Yes | As a parametric model for log pθ(x), we used an 8-layer multi-layer perceptron (MLP) of width 512 with skip connections, as illustrated in Fig. 5. For the MCMC process we used 5 steps of Langevin dynamics... We found 0.0075 to be the step size (multiplying the standard Gaussian noise term) which produces the best results. The optimization of all configurations was preformed using SGD with a momentum of 0.9 and an exponential decaying learning rate. Except for the training of the third configuration, the learning rate ran down from 10^-2 to 10^-4 over 100000 optimization steps. For the third configuration we had to reduce the learning rate by a factor of 10 in order to prevent the optimization from diverging. In order to select the best step size / variance for each of the configurations we ran a parameter sweep around the relevant value range. The results of this sweep are shown in Fig. 6. |