Invariant Tokenization of Crystalline Materials for Language Model Enabled Generation

Authors: Keqiang Yan, Xiner Li, Hongyi Ling, Kenna Ashen, Carl Edwards, Raymundo Arroyave, Marinka Zitnik, Heng Ji, Xiaofeng Qian, Xiaoning Qian, Shuiwang Ji

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experimental results, We conduct experiments on four datasets following Diff CSP [17] and Crysta LLM [12], including Perov-5, Carbon-24, MP-20, and MPTS-52.
Researcher Affiliation Academia 1Texas A&M University, College Station, TX 77843, USA 2University of Illinois Urbana-Champaign, Champaign, IL 61820, USA 3Harvard University, Boston, MA 02115, USA
Pseudocode No The paper describes its method via text and a pipeline diagram (Figure 2) but does not include explicit pseudocode or algorithm blocks.
Open Source Code No The code will be release after the paper is publicly available.
Open Datasets Yes We conduct experiments on four datasets following Diff CSP [17] and Crysta LLM [12], including Perov-5, Carbon-24, MP-20, and MPTS-52. ... We have used datasets including Perov-5, Carbon-24, and MP20 curated by CDVAE [15] with MIT License, MPTS-52 curated by Diff CSP [17] with MIT License, JARVIS-DFT [42] with NIST License, Crysta LLM [12] with MIT License, the Materials Project [26] with Creative Commons Attribution 4.0 License, OQMD [30] with Creative Commons Attribution 4.0 International License, and NOMAD [31] with Apache License Version 2.0, January 2004.
Dataset Splits Yes We directly follow Diff CSP [17] to split corresponding datasets into training, evaluation, and test sets.
Hardware Specification Yes A single NVIDIA A100 GPU is used for computing for this task.
Software Dependencies No The paper mentions using GPT-2 as the language model but does not specify software dependencies like programming languages, libraries, or frameworks with version numbers (e.g., Python, PyTorch, or TensorFlow versions).
Experiment Setup Yes We show the detailed training parameters including window size, batch size, learning rate, drop out ratio, number of training iterations for different tasks in Table. 7. ... During sampling phase, for Perov-5 dataset, we use temperature=0.7 and top-k=10 one shot generation...