On Divergence Measures for Training GFlowNets

Authors: Tiago Silva, Eliezer de Souza da Silva, Diego Mesquita

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Then, we empirically show that the ineffectiveness of divergence-based learning of GFlow Nets is due to the large gradient variance of the corresponding stochastic objectives. To address this issue, we devise a collection of provably variance-reducing control variates for gradient estimation based on the REINFORCE leave-one-out estimator. Our experimental results suggest that the resulting algorithms often accelerate training convergence when compared against previous approaches. All in all, our work contributes by narrowing the gap between GFlow Net training and HVI, paving the way for algorithmic advancements inspired by the divergence minimization viewpoint.
Researcher Affiliation Academia Tiago da Silva Eliezer de Souza da Silva Diego Mesquita {tiago.henrique, eliezer.silva, diego.mesquita}@fgv.br School of Applied Mathematics Getulio Vargas Foundation Rio de Janeiro, Brazil
Pseudocode No The paper describes methods and processes but does not include a clearly labeled 'Pseudocode' or 'Algorithm' block, nor structured steps formatted like code.
Open Source Code No Regarding open access to the code, we will make the code public upon acceptance.
Open Datasets Yes We consider widely adopted benchmark tasks from GFlow Net literature, described in Section 5.1, contemplating both discrete and continuous target distributions. Please refer to Appendix B and Appendix E for additional information on the experimental setup. Section 5.1 and Appendix B detail tasks such as 'Set generation [3, 34, 65, 66]', 'Autoregressive sequence generation [32, 55]', 'Bayesian phylogenetic inference (BPI) [111]', 'Hypergrid navigation [3, 55, 56, 66]', 'Bayesian structure learning [15, 16]', 'Mixture of Gaussians (GMs) [48, 110]', 'Banana-shaped distribution [57, 76]' all with citations to their respective established sources.
Dataset Splits No The paper describes its evaluation protocols and metrics for assessing the learned distributions against target distributions (e.g., L1 distance, Jensen-Shannon divergence), but it does not specify explicit train/validation/test dataset splits with percentages or sample counts in the traditional sense for supervised learning tasks. The datasets for generative tasks are often implicitly defined by the target distribution or generated trajectories.
Hardware Specification No The paper describes the neural network architectures (e.g., MLPs, GIN) and training configurations but does not provide specific details about the hardware used, such as GPU or CPU models, or memory specifications.
Software Dependencies No The paper mentions software frameworks like 'JAX [8]' and 'Py Torch [69]' and optimizers like 'Adam [41]' but does not specify the version numbers for these software components, which is necessary for a reproducible description.
Experiment Setup Yes For every generative task, we used the Adam optimizer [41] to carry out the stochastic optimization, employing a learning rate of 10 1 for log Zθ when minimizing LT B and 10 3 for the remaining parameters, following previous works [48, 55, 56, 66]. We polynomially annealed the learning rate towards 0 along training, similarly to [86]. Also, we use Leaky Re LU [98] as the non-linear activation function of all implemented neural networks. Appendix E provides further details on batch sizes, number of epochs, and specific model architectures for each task.