Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
A Unified View on Learning Unnormalized Distributions via Noise-Contrastive Estimation
Authors: Jongha Jon Ryu, Abhin Shah, Gregory W. Wornell
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Simulation. To demonstrate this behavior, we considered a simple synthetic setup, where the data generating distribution is N(µ, 1) with µ = 1. With a conditional noise distribution π(y|x) = N(y|x, ϵ2I) with varying ϵ, we plot the derivatives of the empirical objective of the original CNCE with varying K {1, 4, 16, 64}, where the sample size is N = 104. As shown in Figure 1, the empirical derivatives characterize the mean fairly closely when ϵ 10−2 or when ϵ is small and K is large. This simple 1D Gaussian example clearly shows the undesirable behavior of the CNCE objective when ϵ is small. More in-depth study on the effect of ϵ and K for high-dimensional problems is left as a future work. (...) G. Experiments. In this section, we present a preliminary empirical evaluation of a selected set of estimators on a synthetic data, following a setting in (Shah et al., 2023, Section 5.1). We consider a unnormalized exponential family model ϕθ(x) exp x θx , where θ Rp p for x [ 1, 1]p. The data generating distribution is chosen as the model with θ = θ defined as ( 1 p if i = 1, or j = 1, or i = j, 0 otherwise. The samples were generated by brute-force sampling by discretizing each axis by 100 bins. We generated N = 105 samples for p {11, 13, 15, 17, 19} and computed the estimates for each estimator with varying sample size {0.04N, 0.08N, . . . , 0.64N}. We repeated the experiments with random subsamples for 5 times for each configuration. Assuming the parameter space Θ is bounded under the Frobenius norm, we consider NCE estimators regularized by the Frobenius norm and optimized via gradient descent. We used a regularization weight λn = 10−2 and a learning rate η = 0.1 across all settings, except for the flog-NCE estimator, where we used η = 1.0. Each optimization was run for 1000 gradient steps. As shown in Figure 2, the selected estimators exhibit an empirical convergence rate of n−1/2. However, we observed that the f1-NCE estimator (asymmetric log NCE; see Table 1) and the CNCE estimator did not display convergent behavior, despite the theoretical guarantees available for this example. This discrepancy highlights the need for further investigation into the empirical behavior of various estimators, particularly in high-dimensional settings. |
| Researcher Affiliation | Academia | 1Department of EECS, MIT, Cambridge, Massachusetts, USA. Correspondence to: J. Jon Ryu <EMAIL>. |
| Pseudocode | No | The paper describes methods and theoretical analyses using mathematical formulations and proofs, but does not include any clearly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | No | The paper describes a preliminary simulation result in Appendix G but does not provide any explicit statement about releasing the source code, a repository link, or mention of code in supplementary materials. |
| Open Datasets | No | To demonstrate this behavior, we considered a simple synthetic setup, where the data generating distribution is N(µ, 1) with µ = 1. (...) In this section, we present a preliminary empirical evaluation of a selected set of estimators on a synthetic data, following a setting in (Shah et al., 2023, Section 5.1). We consider a unnormalized exponential family model ϕθ(x) exp x θx , where θ Rp p for x [ 1, 1]p. The data generating distribution is chosen as the model with θ = θ defined as ( 1 p if i = 1, or j = 1, or i = j, 0 otherwise. The samples were generated by brute-force sampling by discretizing each axis by 100 bins. |
| Dataset Splits | No | We generated N = 105 samples for p {11, 13, 15, 17, 19} and computed the estimates for each estimator with varying sample size {0.04N, 0.08N, . . . , 0.64N}. We repeated the experiments with random subsamples for 5 times for each configuration. This describes varying sample sizes and subsampling for evaluation, but not explicit training/validation/test splits for model reproduction. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running its simulations or experiments. |
| Software Dependencies | No | The paper mentions optimization parameters like regularization weight and learning rate but does not specify any software dependencies (e.g., library or solver names with version numbers). |
| Experiment Setup | Yes | We used a regularization weight λn = 10−2 and a learning rate η = 0.1 across all settings, except for the flog-NCE estimator, where we used η = 1.0. Each optimization was run for 1000 gradient steps. |