Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Crystal Diffusion Variational Autoencoder for Periodic Material Generation
Authors: Tian Xie, Xiang Fu, Octavian-Eugen Ganea, Regina Barzilay, Tommi S. Jaakkola
ICLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We significantly outperform past methods in three tasks: 1) reconstructing the input structure, 2) generating valid, diverse, and realistic materials, and 3) generating materials that optimize a specific property. We also provide several standard datasets and evaluation metrics for the broader machine learning community. 1 |
| Researcher Affiliation | Academia | Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 02139, USA |
| Pseudocode | Yes | Algorithm 1 Material Generation via Annealed Langevin Dynamics |
| Open Source Code | Yes | Code and data are available at https://github.com/txie-93/cdvae |
| Open Datasets | Yes | We curated 3 datasets representing different types of material distributions. 1) Perov5 (Castelli et al., 2012a;b)... 2) Carbon-24 (Pickard, 2020)... 3) MP-20 (Jain et al., 2013) |
| Dataset Splits | Yes | We use a 60-20-20 random split for all of our experiments. |
| Hardware Specification | Yes | Time used for generating 10,000 materials on a single RTX 2080 Ti GPU. |
| Software Dependencies | No | The paper mentions several software components like 'pymatgen', 'Crystal NN', 'Dime Net++', 'Gem Net-d Q', and the 'Open Catalysis Project (OCP)'. However, it does not provide specific version numbers for these software dependencies, which are required for full reproducibility. |
| Experiment Setup | Yes | The total loss can be written as, L = LAGG + LDEC + LKL = λc Lc + λLLL + λNLN + λXLX + λALA + βLKL. We aim to keep each loss term at a similar scale. For all three datasets, we use λc = 1, λL = 10, λN = 1, λX = 10, LA = 1. We tune β between 0.01, 0.03, 0.1 for all three datasets and select the model with best validation loss. For Perov-5, MP-20, we use β = 0.01, and for Carbon-24, we use β = 0.03. For the noise levels in {σA,j}L j=1, {σX,j}L j=1, we follow Shi et al. (2021) and set L = 50. For all three datasets, we use σA,max = 5, σA,min = 0.01, σX,max = 10, σX,min = 0.01. During the training, we use an initial learning rate of 0.001 and reduce the learning rate by a factor of 0.6 if the validation loss does not improve after 30 epochs. The minimum learning rate is 0.0001. During the generation, we use ϵ = 0.0001 and run Langevin dynamics for 100 steps at each noise level. |