Invariant Tokenization of Crystalline Materials for Language Model Enabled Generation
Authors: Keqiang Yan, Xiner Li, Hongyi Ling, Kenna Ashen, Carl Edwards, Raymundo Arroyave, Marinka Zitnik, Heng Ji, Xiaofeng Qian, Xiaoning Qian, Shuiwang Ji
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experimental results, We conduct experiments on four datasets following Diff CSP [17] and Crysta LLM [12], including Perov-5, Carbon-24, MP-20, and MPTS-52. |
| Researcher Affiliation | Academia | 1Texas A&M University, College Station, TX 77843, USA 2University of Illinois Urbana-Champaign, Champaign, IL 61820, USA 3Harvard University, Boston, MA 02115, USA |
| Pseudocode | No | The paper describes its method via text and a pipeline diagram (Figure 2) but does not include explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The code will be release after the paper is publicly available. |
| Open Datasets | Yes | We conduct experiments on four datasets following Diff CSP [17] and Crysta LLM [12], including Perov-5, Carbon-24, MP-20, and MPTS-52. ... We have used datasets including Perov-5, Carbon-24, and MP20 curated by CDVAE [15] with MIT License, MPTS-52 curated by Diff CSP [17] with MIT License, JARVIS-DFT [42] with NIST License, Crysta LLM [12] with MIT License, the Materials Project [26] with Creative Commons Attribution 4.0 License, OQMD [30] with Creative Commons Attribution 4.0 International License, and NOMAD [31] with Apache License Version 2.0, January 2004. |
| Dataset Splits | Yes | We directly follow Diff CSP [17] to split corresponding datasets into training, evaluation, and test sets. |
| Hardware Specification | Yes | A single NVIDIA A100 GPU is used for computing for this task. |
| Software Dependencies | No | The paper mentions using GPT-2 as the language model but does not specify software dependencies like programming languages, libraries, or frameworks with version numbers (e.g., Python, PyTorch, or TensorFlow versions). |
| Experiment Setup | Yes | We show the detailed training parameters including window size, batch size, learning rate, drop out ratio, number of training iterations for different tasks in Table. 7. ... During sampling phase, for Perov-5 dataset, we use temperature=0.7 and top-k=10 one shot generation... |