From MAP to Marginals: Variational Inference in Bayesian Submodular Models
Authors: Josip Djolonga, Andreas Krause
NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide the first general treatment of probabilistic inference with log-submodular and log-supermodular distributions, that can capture high-order variable interactions. We develop L-FIELD, a novel variational inference scheme that optimizes over sub- and supergradients of submodular functions. Our scheme yields both upper and lower bounds on the partition function, which imply rigorous probability intervals for marginals. We can also obtain factorial approximations of the distribution at no larger computational cost than performing MAP inference in the model (for which a plethora of algorithms are available). We identify a natural link between our scheme and the well-known mean-field method. We establish theoretical guarantees about the accuracy of our bounds, dependent on the curvature of the underlying submodular function. We demonstrate the accuracy of L-FIELD on several Bayesian submodular models. Our experiments aim to address four main questions: (1) How large is the gap between the upper- and lower-bounds for the log-partition function and the marginals? (2) How accurate are the factorized approximations obtained from a single MAP-like optimization problem? (3) How does the accuracy depend on the amount of evidence (i.e., concentration of the posterior), the curvature of the function, and the type of Bayesian submodular model considered? (4) How does L-FIELD compare to mean-field on problems where the latter can be applied? |
| Researcher Affiliation | Academia | Josip Djolonga Department of Computer Science ETH Z urich josipd@inf.ethz.ch Andreas Krause Department of Computer Science ETH Z urich krausea@ethz.ch |
| Pseudocode | Yes | 1: procedure FRANK-WOLFE(F, x1, ϵ) 2: Define f(x) = log(1 + e x) Elementwise. 3: for k 1, 2, . . . , T do 4: Pick s argminx B(F ) x, f(xk) 5: if xk s, f(xk) ϵ then 6: return xk Small duality gap. 7: else 8: xk+1 = (1 γk)xk + γks; γk = 2 k+2 1: procedure DIVIDE-CONQUER(F) 2: s F (V ) |V | 1; A minimizer of F( ) s( ) 3: if F(A ) = s(A ) then 4: return s 5: else 6: s A DIVIDE-CONQUER(F A) 7: s V A DIVIDE-CONQUER(FA) 8: return (s A, s V A) |
| Open Source Code | Yes | The code will be made available at http://las.ethz.ch. |
| Open Datasets | Yes | Motivated by applications in semi-supervised learning, we sampled data from a 2-dimensional Gaussian mixture model with 2 clusters. We sampled random graphs with 100 nodes from a Watts-Strogatz model We also have a set I of contamination scenarios. For each i I and j V we have a utility Ci,j [0, 1], that comes from real data [35]. |
| Dataset Splits | No | The paper does not explicitly describe standard training/validation/test dataset splits with percentages or sample counts. It discusses conditional distributions and how they are used in experiments, but not formal data splits for reproducibility. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments, such as CPU/GPU models, memory, or cloud computing specifications. |
| Software Dependencies | No | The paper mentions using 'the min-cut implementation from [34]' but does not provide specific version numbers for this or any other software dependencies. |
| Experiment Setup | Yes | Motivated by applications in semi-supervised learning, we sampled data from a 2-dimensional Gaussian mixture model with 2 clusters. The centers were sampled from N([3, 3], I) and N([−3, 3], I) respectively. For each cluster, we sampled n = 50 points from a bivariate normal. These 2n points were then used as nodes to create a graph with weight between points x and x′ equal to e^−||x−x′||^2. As prior we chose P(A) ∼ exp(−F(A)), where F is the cut function in this graph, hence P(A) is a regular MRF. We used a log-supermodular prior P(A) ∼ exp( sum_{v in V} |Nv intersect A| / |Nv|^µ ), where µ in [0, 1] and Nv is the union of v and its neighbors. We used the following sampling strategy. We consider nodes v V in some order. We then sample a Bernoulli Z with probability P(Z = 1) = qv based on the factorized distribution q from the modular upper bound. We then condition on v S if Z = 1, or v / S if Z = 0. |