Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Symmetry-Aware GFlowNets

Authors: Hohyun Kim, Seunggeun Lee, Min-Hwan Oh

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results show that SA-GFN enables unbiased sampling while enhancing diversity and consistently generating high-reward graphs that closely match the target distribution. Through experiments, we validate our theoretical results, and demonstrate the effectiveness of our method in generating diverse and high-reward samples. In this section, we conduct experiments to validate our theoretical results and demonstrate the effectiveness of our method.
Researcher Affiliation	Academia	1Graduate School of Data Science, Seoul National University, Seoul, Republic of Korea. Correspondence to: Seunggeun Lee <EMAIL>, Min-hwan Oh <EMAIL>.
Pseudocode	No	The paper describes algorithms and methods in prose and through mathematical formulations but does not include any explicitly labeled pseudocode or algorithm blocks. For example, Section 5, 'Symmetry-Aware GFlow Nets', describes the method without a pseudocode block.
Open Source Code	Yes	1Source code available at: https://github.com/ hohyun312/sagfn
Open Datasets	Yes	This bias is particularly problematic for tasks such as molecular generation, as molecules inherently possess natural symmetries. For instance, in the ZINC250k dataset, over 50% of molecules exhibit more than one symmetry, with 18% containing four or more symmetries. ... Rewards are provided by a proxy model, which predicts the HOMO-LUMO gap. In the fragment-based task, we use a predefined set of fragments... For the Pearson correlation evaluation presented in Figure 4, terminal states were sampled by uniformly selecting random actions. The model likelihood was computed using Equation (4), with a modified correction term in Theorem 5.3. We set M = 5 and used 2,048 samples for the test set.
Dataset Splits	Yes	For this relatively small environment, we compute exact terminating probabilities for all states without approximations for evaluation. ... To evaluate the effectiveness of the proposed model likelihood estimator, we sampled 100 terminal states for each category (5, 9, and 12 edges), resulting in a total of 300 states, and estimated their model likelihood using Equation (4). ... We sampled 5,000 molecules from each method and evaluated them using common metrics.
Hardware Specification	Yes	Experiments were conducted on an Apple M1 processor. ... All timing experiments were conducted using a single processor with a TITAN RTX GPU (24GB) and an Intel Xeon Silver 4216 CPU. ... Experiments were performed using Intel Xeon Silver 4216 CPU.
Software Dependencies	No	We used the Adam optimizer (Kingma, 2014) with the default parameters from Py Torch (Paszke et al., 2019) settings... In our experiments, we used the bliss algorithm (Junttila & Kaski, 2007), included in the igraph package (Csardi & Nepusz, 2006)... For large molecules, we can still count automorphisms in few milliseconds using the nauty package (Mc Kay & Piperno, 2013)... We used a open-source code for tasks.2 We used a graph transformer architecture (Yun et al., 2019).
Experiment Setup	Yes	Details on hyperparameters and model configurations can be found in Appendix K. ... For the illustrative experiment, homogeneous graphs were constructed edge by edge, allowing only Add Edge and Stop actions. ... We trained the models for 30,000 updates using the TB objective. During the first 16,000 steps, each update used a batch of 128 trajectories, comprising 32 samples from the current policy and 96 samples drawn from the replay buffer. We used the Adam optimizer (Kingma, 2014) with the default parameters from Py Torch (Paszke et al., 2019) settings, except for the learning rates: 0.0001 for GNN layers and 0.01 for the normalizing constant Z. For the remaining steps, we increased batch size to 256 and annealed the learning rate to 0.00001. ... Table 7: Hyperparameters for atom-based experiments, Table 8: Hyperparameters for fragment-based experiments.