Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SymmCD: Symmetry-Preserving Crystal Generation with Diffusion Models

Authors: Daniel Levy, Siba Smarak Panigrahi, Sékou-Oumar Kaba, Qiang Zhu, Kin Long Kelvin Lee, Mikhail Galkin, Santiago Miret, Siamak Ravanbakhsh

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically demonstrate our contributions, particularly in ensuring we generate crystals with desired symmetries while being competitive with existing baselines. We compare our proposed method with four recent strong baselines: CDVAE (Xie et al., 2022), Diff CSP (Jiao et al., 2023), Diff CSP++ (Jiao et al., 2024) and Flow MM (Miller et al., 2024). We retrained each method according to their given hyperparameters, and generated 10,000 crystals each.
Researcher Affiliation	Collaboration	Daniel Levy 1,2, Siba Smarak Panigrahi 1,2,3, S ekou-Oumar Kaba 1,2, Qiang Zhu4, Kin Long Kelvin Lee5, Mikhail Galkin5, Santiago Miret5, Siamak Ravanbakhsh1,2 1Mc Gill University, 2Mila, 3 Ecole Polytechnique F ed erale de Lausanne (EPFL), 4University of North Carolina at Charlotte, 5Intel Labs
Pseudocode	Yes	Algorithm 1 Training Symm CD 1: Input: Dataset of crystals D 2: while not converged do 3: Sample a crystal C = (L, X, A) from dataset D, and a timestep t Uniform(1, T) 4: Derive the asymmetric representation C = (G, k, X , A , S) from C 5: Add noise to k, X , A , and S : 6: kt = αtk0 + 1 αtϵk, ϵk N(0, I) 7: X t = αt X 0 + 1 αtϵX , ϵX WN(0, I) 8: A t Cat(A Qa,t) 9: Su,t Cat(S Qu,G,t) 10: Use denoising network ϕ to predict ˆϵk, ˆϵX , ˆA , ˆS from noisy Ct = (G, kt, X t, A t, St), t 11: Compute losses Lk, LX , LA , LS 12: Update the denoising network ϕ using total loss: 13: L = λk Lk + λX LX + λA LA + λSLS 14: end while Algorithm 2 Sampling from Symm CD 1: Input: Target space group G, Number of representatives M 2: Initialize: 3: Sample k T N(0, I) 4: Sample X T U(0, 1)3 M 5: Sample A T pmarginal(A ) 6: Sample S T pmarginal(S \|G) (site symmetries) 7: for t = T to 1 do 8: Compute ˆϵk, ˆϵX , ˆA , ˆS using denoising network ϕ( ) 9: Sample kt 1, X t 1, A t 1, S t 1 using ˆϵk, ˆϵX , ˆA , ˆS. 10: end for 11: Project S 0 onto nearest valid point group 12: Project X 0 onto nearest Wyckoff position with that site symmetry 13: Replicate representative atoms X 0 using site symmetries S 0 to generate full crystal X0 14: Output: Crystal structure X0, Atom types A0, lattice L0
Open Source Code	Yes	Our code is publicly available at https://github.com/sibasmarak/Symm CD/
Open Datasets	Yes	We test our model on de novo crystal generation using the MP-20 dataset (Xie et al., 2022), a subset of the Materials Project (Jain et al., 2013) consisting of 40,476 crystals, each with up to 20 atoms per primitive unit cell. In addition to MP-20, we also trained Symm CD on the MPTS-52 dataset (Baird et al., 2024), a more challenging subset of the Materials Project that contains materials with up to 52 atoms per primitive unit cell.
Dataset Splits	Yes	We withhold 20% of the dataset as a validation set, and 20% as a test set. The dataset contains 40,476 samples with a train/validation/test split of 27,380/5,000/8,096 crystals.
Hardware Specification	Yes	We compare the two representations for one epoch of training using 40GB of RAM and a single NVIDIA MIG A100 and report the results in Table 4.
Software Dependencies	Yes	We carried out electronic structure calculations as a more accurate way to evaluate the crystal structures generated by the different methods. Since these computations are significantly more expensive than the ones with CHGNet, we performed them on 100 structures sampled from each model. Concretely, these involved using the CP2K (K uhne et al., 2020) suite of programs to perform cell and geometry relaxations to assess the quality of generated structures based on their distances from true local minima. Table 9: Configuration settings for CP2K. Settings that are omitted from this table assume their default values. Base SCF settings EPS SCF 10 7 MAX SCF 300 MAX ITER LUMO 400 IGNORE CONVERGENCE FAILURE T Orbital transformation Orbital transformation method IRAC ENERGY GAP 10 3 MINIMIZER DIIS LINESEARCH 2PNT PRECONDITIONER FULL ALL Outer SCF settings MAX SCF 20 EPS SCF 10 6 Cell optimization TYPE DIRECT CELL OPT MAX ITER 100 OPTIMIZER BFGS Geometry optimization MAX DR 3 10 3 MAX FORCE 9 10 4 RMS DR 1.5 10 3 RMS FORCE 6 10 4 MAX ITER 100 OPTIMIZER BFGS BFGS TRUST RADIUS 0.25
Experiment Setup	Yes	The graph neural network has 8 layers, and we use a representation dimension of 1024 for hi. We encode distances between nodes using a sinusoidal embedding, with 128 different frequencies. We encode the timestep t into a 10 dimensional vector. We apply layer normalization at each layer of the GNN. The loss coefficients selected were λk = 5, λX = 1, λA = 0.1 and λS = 10. We trained Symm CD on MPTS-52 using all of the same hyperparameters as were used for the MP-20 dataset, but trained for 1500 epochs.