Pitfalls of Gaussians as a noise distribution in NCE

Authors: Holden Lee, Chirag Pabbaraju, Anish Prasad Sevekari, Andrej Risteski

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We also verify our results with simulations. Precisely, we study the MSE for the empirical NCE loss as a function of the ambient dimension, and recover the dependence from Theorem 4. For dimension d {70, 72, . . . , 120}, we generate n = 500 samples from the distribution P we construct in the theorem. We generate an equal number of samples from the noise distribution Q = N(0, Id), and run gradient descent to minimize the empirical NCE loss to obtain ˆθn. Since we explicitly know what θ is, we can compute the squared error ˆθn θ 2. We run 100 trials of this, where we obtain fresh samples each time from P and Q, and average the squared errors over the trials to obtain an estimate of the MSE. Figure 1 shows the plot of log MSE versus dimension we can see that the graph is nearly linear. This corroborates the bound in Theorem 4, which tells us that as n , the MSE scales exponentially with d.
Researcher Affiliation Academia Holden Lee Johns Hopkins University hlee283@jhu.edu Chirag Pabbaraju Stanford University cpabbara@cs.stanford.edu Anish Sevekari Carnegie Mellon University asevekar@andrew.cmu.edu Andrej Risteski Carnegie Mellon University aristesk@andrew.cmu.edu
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide a statement or link indicating that source code for the methodology is openly available.
Open Datasets No The paper states: "For dimension d {70, 72, . . . , 120}, we generate n = 500 samples from the distribution P we construct in the theorem. We generate an equal number of samples from the noise distribution Q = N(0, Id)..." This indicates the data was generated internally based on a theoretical distribution, not a publicly accessible dataset with a link or citation.
Dataset Splits No The paper mentions generating samples for simulations but does not specify any training/validation/test splits, nor does it refer to predefined splits with citations.
Hardware Specification No The paper does not provide any specific hardware details such as GPU or CPU models, memory, or specific computing environments used for the simulations.
Software Dependencies No The paper mentions running 'gradient descent' but does not specify any software libraries or their version numbers used for implementation (e.g., PyTorch, TensorFlow, scikit-learn, etc.).
Experiment Setup No The paper states: "For dimension d {70, 72, . . . , 120}, we generate n = 500 samples from the distribution P we construct in the theorem. We generate an equal number of samples from the noise distribution Q = N(0, Id), and run gradient descent to minimize the empirical NCE loss to obtain ˆθn." While it specifies the number of samples and dimension range, it lacks specific hyperparameters for gradient descent (e.g., learning rate, batch size, epochs) or other detailed training configurations necessary for full reproducibility.