reproducibilityindex.ai

Contrastive Divergence Learning is a Time Reversal Adversarial Game

Authors: Omer Yair, Tomer Michaeli

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Figure 3: A toy example illustrating the importance of the adversarial nature of CD. Here, the data lies close to a 2D spiral embedded in a 10-dimensional space. (a) The training samples in the ﬁrst 3 dimensions. (b) Three different approaches for learning the distribution: CNCE with large contrastive variance (top), CNCE with small contrastive variance (middle), and CD based on Langevin dynamics MCMC with the weight adjustment described in Sec. 3.4 (bottom). As can be seen in the ﬁrst two columns, CD adapts the contrastive samples according to the data distribution, whereas CNCE does not. Therefore, CNCE with large variance fails to learn the distribution because the vast majority of its contrastive samples are far from the manifold and quickly become irrelevant (as indicated by the weights αθ in the third column). And CNCE with small variance fails to learn the global structure of the distribution because its contrastive samples are extremely close to the dataset samples. CD, on the other hand, adjusts the contrastive distribution during training, so as to generate samples that are close to the manifold yet traverse large distances along it. Figure 4: Here, we use different CD conﬁgurations for learning the model of Fig. 3. All conﬁgurations use Langevin dynamics as their MCMC process, but with different ways of compensating for the lack of detailed balance. From left to right we have the ground-truth density, CD w/o any correction, CD with Metropolis-Hastings rejection, and CD with our proposed adjustment. To illustrate our observations, we now conclude with a simple toy example (see Fig. 3).
Researcher Affiliation	Academia	Yair Omer Department of Electrical Engineering Technion Israel Institute of Technology Haifa, Israel omeryair@gmail.com Tomer Michaeli Department of Electrical Engineering Technion Israel Institute of Technology Haifa, Israel tomer.m@ee.technion.ac.il
Pseudocode	Yes	Algorithm 1: Contrastive Divergence k Algorithm 2: Adjusted Contrastive Divergence k
Open Source Code	Yes	The code for reproducing the results is available at (for the blind review the code will be available in the supplementary material).
Open Datasets	No	The paper describes generating a synthetic dataset: "We take data concentrated around a 2-dimensional manifold embedded in 10-dimensional space. Speciﬁcally, let e(1), . . . , e(10) denote the standard basis in R10. Then each data sample is generated by adding Gaussian noise to a random point along a 2D spiral lying in the e(1)-e(2) plane." No public dataset is cited or linked.
Dataset Splits	No	The paper does not provide explicit training, validation, or test dataset splits (e.g., percentages or sample counts). It mentions "training samples" but no detailed partitioning.
Hardware Specification	No	The paper does not specify any hardware used for running the experiments, such as GPU or CPU models.
Software Dependencies	No	The paper mentions using a "multi-layer perceptron (MLP)" but does not provide specific software names with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific libraries with their versions).
Experiment Setup	Yes	As a parametric model for log pθ(x), we used an 8-layer multi-layer perceptron (MLP) of width 512 with skip connections, as illustrated in Fig. 5. For the MCMC process we used 5 steps of Langevin dynamics... We found 0.0075 to be the step size (multiplying the standard Gaussian noise term) which produces the best results. The optimization of all conﬁgurations was preformed using SGD with a momentum of 0.9 and an exponential decaying learning rate. Except for the training of the third conﬁguration, the learning rate ran down from 10^-2 to 10^-4 over 100000 optimization steps. For the third conﬁguration we had to reduce the learning rate by a factor of 10 in order to prevent the optimization from diverging. In order to select the best step size / variance for each of the conﬁgurations we ran a parameter sweep around the relevant value range. The results of this sweep are shown in Fig. 6.