Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
When Models Don’t Collapse: On the Consistency of Iterative MLE
Authors: Daniel Barzilai, Ohad Shamir
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | There are by now many experiments in the literature that support our finding from Theorem 4.1 [Alemohammad et al., 2024, Gerstgrasser et al., 2025, Dey and Donoho, 2024]. In particular, in those papers, the error does not increase much from iteration to iteration when synthetic data is added gradually. Rather than repeating experiments, we analyze the difference between exact MLE solutions and those that are obtained via optimization. To this end, we pick several families whose MLE has a known closed form. These include a Gaussian (where the parameters are the mean and std) in Fig. 1, Exponential distribution in Fig. 2, and a family of Beta distributions with PDFs given by p(x; θ) = θxθ 1 for θ > 0 and x (0, 1) in Fig. 3. The real parameters are θ0 = (µ = 0, σ = 1) for the Gaussian and θ0 = 1 for the other distributions. When optimizing numerically for the MLE, we use scipy.optimize.minimize on the negative log likelihood to find the parameters. We opt for this built-in function to remove any uncertainty regarding the quality of the optimization code itself. We take the number of samples to be one of 20, 50, or 100. We run the iterative MLE algorithm as specified in the paper for up to T = 100. All values are averaged over 50 runs. The error is measured by the norm relative to the real parameters, meaning θ(t) θ0 . |
| Researcher Affiliation | Academia | Daniel Barzilai Weizmann Institute of Science EMAIL Ohad Shamir Weizmann Institute of Science EMAIL |
| Pseudocode | Yes | Algorithm 1 Iterative Maximum Likelihood Estimation Require: Parameter space Θ Rd; family of distributions {pθ}θ Θ over input space X; number of samples per iteration n; target parameters θ Θ. 1: Set θ(0) := θ 2: for T = 0, 1, 2, . . . do 3: sample X(T ) := {x(T ) 1 , . . . , x(T ) n } pθ(T ) ( ) i.i.d. 4: Define cumulative dataset: X( T ) := ST t=0 X(t) 5: Train model on X( T ): θ(T +1) := argmin θ Θ t=0 ℓt (θ) , ℓt(θ) := 1 i=1 log pθ(x(T ) i ) , |
| Open Source Code | No | Answer: [NA] Justification: [NA] |
| Open Datasets | No | The paper describes simulating data from Gaussian, Exponential, and Beta distributions, not using pre-existing public datasets. The NeurIPS Paper Checklist also indicates no open access to data. "Answer: [NA] Justification: [NA]" |
| Dataset Splits | No | The paper describes simulations by sampling from distributions, not using pre-existing datasets that would require explicit training/test/validation splits. |
| Hardware Specification | No | The paper does not provide any specific hardware details used for running its experiments. The NeurIPS Paper Checklist confirms this. "Answer: [NA] Justification: [NA]" |
| Software Dependencies | No | The paper mentions using "scipy.optimize.minimize" but does not provide specific version numbers for scipy or any other software dependencies needed for reproducibility. |
| Experiment Setup | Yes | We take the number of samples to be one of 20, 50, or 100. We run the iterative MLE algorithm as specified in the paper for up to T = 100. All values are averaged over 50 runs. The error is measured by the norm relative to the real parameters, meaning θ(t) θ0 . In all cases, the error at all timesteps is similar to the error at time 1, as our theory would suggest. Furthermore, we observe the model (non)-collapse behavior between the exact MLE and the optimized one to be similar. We also consider various θ0 going from 0.1 to 1 for the Beta distribution, where a smaller θ0 corresponds to a neighborhood of the parameters that are less smooth. In all cases, we plot the ratio between the error at time T to the error at time 1. |