Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Rethinking Losses for Diffusion Bridge Samplers
Authors: Sebastian Sanokowski, Lukas Gruber, Christoph Bartmann, Sepp Hochreiter, Sebastian Lehner
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Experiments Task GMM-40 (50d) Mo S-10 (50d) Metric Sinkhorn ( ) ELBO ( ) EMC ( ) MMD ( ) Sinkhorn ( ) ELBO ( ) EMC ( ) MMD ( ) Ground Truth 875.21 86.023 0. 1. 0.07 0.001 449.06 104.87 0. 1. 0.07 0.001 ... Table 1: Results on Bayesian learning benchmarks. The ELBO (the higher the better) is reported for various methods and tasks. Table 2: Results on GMM-40 (50d) and Mo S-10 (50d). Table 3: Results on Funnel (10d) and Many Well (5d). Our experimental setup mirrors Chen et al. [2024a], i.e. training all methods for 40.000 training iterations with a batch size of 2.000 on all tasks besides LGCP with a batch size of 300. Models are trained using 128 diffusion steps. |
| Researcher Affiliation | Collaboration | 1 Technical University Munich, Chair of Robotics, Artificial Intelligence and Embedded Systems 2 ELLIS Unit Linz, LIT AI Lab, Johannes Kepler University Linz, Austria 3 NXAI Lab & NXAI Gmb H, Linz, Austria |
| Pseudocode | Yes | D.4 Pseudoalgorithms In the following we provide pseudoalgorithms for the computation of the LV loss, r KL-LD loss and the r KL-R Loss: Algorithm 1 Computation of the LV Loss 1: Given: Batch of diffusion paths X0 T qα,ν computed with the Euler-Maruyama integration (Eq. 4) 2: X 0 T stop grad(X0 T ) detach the gradient from the Euler-Maruyama integration 3: compute log qα,ν(X 0 T ) with Eq. 5 4: compute log pϕ,ν(X 0 T ) with Eq. 6 5: compute Loss L(α,ϕ,ν) = V ar (log qα,ν(X 0 T ) pϕ,ν(X 0 T )) 6: Backpropagate through the loss and update (α,ϕ,ν) using Adam optimizer Algorithm 2 Computation of the r KL-R Loss 1: Given: Batch of diffusion paths X0 T qα,ν computed with the Euler-Maruyama integration (Eq. 4) 2: compute log qα,ν(X0 T ) with Eq. 5 3: compute log pϕ,ν(X0 T ) with Eq. 6 4: compute Loss L(α,ϕ,ν) = mean[log qα,ν(X0 T ) pϕ,ν(X0 T )] 5: Backpropagate through the loss and update (α,ϕ,ν) using Adam optimizer Algorithm 3 Computation of the r KL-LD Loss: Averages are always computed over the batch dimension 1: Given: Batch of diffusion paths X0 T qα,ν(X0 T ) computed with the Euler-Maruyama integration (Eq. 4) 2: X 0 T stop grad(X0 T ) 3: compute log qα,ν(X 0 T ) with Eq. 5 4: compute log pϕ,ν(X 0 T ) with Eq. 6 5: compute control variate bqα,ν α,ϕ,ν = mean[log qα,ν(X 0 T ) pϕ,ν(X 0 T )] 6: compute advantages A = stop gradient[log qα,ν(X 0 T ) pϕ,ν(X 0 T ) bqα,ν α,ϕ,ν 1] 7: compute Loss L(α,ϕ,ν) = mean[A log qα,ν(X 0 T )] mean[log pϕ,ν(X 0 T )] 8: Backpropagate through the loss and update (α,ϕ,ν) using Adam optimizer |
| Open Source Code | Yes | 1Our code is available at https://github.com/sanokows/Rethinking Lossesfor Diffusion Bridge Samplers. |
| Open Datasets | Yes | 5 Experiments Benchmarks: Following Chen et al. [2024a], we evaluate our model on two types of tasks: In Tab. 1 we evaluate on Bayesian learning problems, where we report the ELBO due to the absence of data samples (see App. C). In Tab. 2 and Tab. 3 we evaluate Synthetic targets, where we report the Sinkhorn distance, Entropic Mode Coverage (EMC) Blessing et al. [2024] and Maximum Mean Discrepancy (MMD), which are all based on samples from the diffusion sampler and samples from the target distribution (see App. C). On multimodal tasks, a combination of high ELBO and EMC values and low Sinkhorn and MMD distances indicate good performance. Detailed descriptions of all problem types are provided in App. C.1. C.1.1 Bayesian Learning tasks These tasks involve probabilistic inference where the true underlying parameter distributions are unknown, requiring Bayesian approaches for estimation. Bayesian Logistic Regression (Sonar and Credit). We consider Bayesian logistic regression for binary classification on two well-established benchmark datasets, frequently used for evaluating variational inference and Markov Chain Monte Carlo (MCMC) methods. Random Effect Regression (Seeds): The Seeds dataset (d = 26) is modeled using a hierarchical random effects regression framework, which captures both fixed and random effects to account for variability across different experimental conditions. Time Series Models (Brownian): The Brownian motion model (d = 32) represents a discretized stochastic process commonly used in time series analysis, with Gaussian observation noise. Spatial Statistics (LGCP): The Log-Gaussian Cox Process (LGCP) is a widely used spatial model in statistics [Møller et al., 1998], which describes spatially distributed point processes such as the locations of tree saplings. C.1.2 Synthetic targets: For these tasks, ground-truth samples are available, allowing for direct evaluation of inference accuracy. Mixture distributions (GMM and Mo S): We consider mixture models where the target distribution consists of m mixture components, defined as: ptarget = 1 Funnel: The Funnel distribution, originally introduced in Neal [2003], serves as a challenging benchmark due to its highly anisotropic shape. |
| Dataset Splits | No | The paper uses well-established benchmark datasets and synthetic targets, implying standard setups. However, it does not explicitly provide specific train/test/validation split percentages, absolute sample counts for splits, or citations to predefined splits within the paper. The evaluation section (D.1) mentions samples used for evaluation metrics, but not for dataset splitting. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments. It mentions training iterations and batch sizes, but no GPU models, CPU models, memory specifications, or cloud/cluster configurations are detailed. |
| Software Dependencies | No | The paper mentions several software components, such as 'RAdam Liu et al. [2020] optimizer', 'ott package Cuturi et al. [2022]', and 'code from Chen et al. [2024b]'. However, it does not specify version numbers for any of these libraries, optimizers, or the programming language used. |
| Experiment Setup | Yes | Our experimental setup mirrors Chen et al. [2024a], i.e. training all methods for 40.000 training iterations with a batch size of 2.000 on all tasks besides LGCP with a batch size of 300. Models are trained using 128 diffusion steps. D.2 Hyperparameter tuning In benchmark experiments in Sec. 5 we perform for each method a grid search over σdiff, σprior, the learning rate of the model. The learning rate of the diffusion parameters such as σprior and σdiff is always chosen to be equal to the model learning rate. On all Bayesian learning tasks, we perform for CMCD and DBS a grid search over σdiff,init = {0.1,0.3}, σprior,init = {0.5,1.0} and the learning rate λmodel,SDE {0.005,0.002,0.001}. On Brownian and German Credit, we found that if σdiff is not learned a finer grid-search over σdiff is necessary. Therefore on Brownian, we additionally add σdiff = 0.05 and on German Credit σdiff = 0.01 to the grid search. On Mo S 50D and GMM 50D, we follow Chen et al. [2024a] and fix σprior,init to a high initial value. We found that σprior,init = 80 yielded the best results. We found that small model learning rates and compared to that large learning rates of the interpolation parameters between the prior and the target distribution work well. Therefore we adapt the grid search to λinterpol = {0.01,0.001} and the learning rate λmodel,SDE {0.0001,0.00005,0.00001} for CMCD. For DBS, the interpolation between the prior and the target distribution is not learned. Therefore, we did not additionally search over λinterpol but increased the size of the grid search by searching over σprior,init = {60,80}. On Many Well we conduct grid search over σdiff,init = {0.05,0.1,0.2}, σprior,init = {0.5,1.0,2.0}, λmodel,SDE {0.001,0.0001,0.00001}. D.3 Architecture Score parametrization The learned score with parameters θ is parameterized in the following way: sθ(Xt) = clip( sθ(Xt,t) + ˆsθ(t) clip( Xt log πt(Xt), 102,102), 104,104) (20) where sθ(Xt,t) and ˆsθ(t) are two separate neural networks which are parameterized with the PISgradnet architecture from Vargas et al. [2024] with 64 hidden neurons and 2 layers. Parametrization diffusion coefficient: We keep diffusion coefficients constant across time steps and parameterize it as σt = expγ, where γ = log σinit. In principle, one could parameterize it similarly as the interpolation parameters, which would allow for a time-adaptive schedule. However, we leave this for future work. Training All parameters are trained with the usage of the RAdam Liu et al. [2020] optimizer. We use gradient clipping by norm at the value of 1. The learning rate decays with a cosine learning rate schedule from λstart to λstart/10. |