Machine Learning Force Fields with Data Cost Aware Training

Authors: Alexander Bukharin, Tianyi Liu, Shengjie Wang, Simiao Zuo, Weihao Gao, Wen Yan, Tuo Zhao

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on MD datasets and downstream tasks validate the efficacy of ASTEROID. Our code and data are available at https: //github.com/abukharin3/asteroid.
Researcher Affiliation Collaboration 1Department of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, USA 2Byte Dance Inc, Seattle, USA. Work was done during Alexander Bukharin s internship at Byte Dance
Pseudocode No The paper includes a workflow diagram (Figure 2) but does not present any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Our code and data are available at https: //github.com/abukharin3/asteroid.
Open Datasets Yes For the CCSD(T) data, we use MD17@CCSD, which contains 1,000 configurations labeled at the CCSD(T) and CCSD level of accuracy for five molecules (Chmiela et al., 2017). For DFT data, we use the MD17 dataset, which contains molecular configurations labeled at the DFT level of accuracy (Chmiela et al., 2017). For the empirical force field data, we generate 100,000 configurations for each molecule using the Open MM empirical force field software (Eastman et al., 2017). For the unlabeled datasets, we use MD17 with the force labels removed.
Dataset Splits Yes A validation set of size 50 and a test set of size 500 are used in all experiments.
Hardware Specification No The paper mentions 'on modern hardware' but does not provide specific details about the CPU, GPU, memory, or any other hardware components used for running the experiments.
Software Dependencies No The paper mentions the use of Gem Net, EGNN, Atomic Simulation Environment (ASE), and Open MM software but does not specify their version numbers or other crucial software dependencies required for reproducibility.
Experiment Setup Yes When training with inaccurate data, we train with a batch size of 16 and stop training when the validation loss stabilizes. In the fine-tuning stage, we use a batch size of 10 and train for a maximum of 2000 epochs. To tune the bias aware loss parameter γ, we search in the set {0.1, 0.5, 1.0, 2.0} and select the model with the lowest validation loss. Comprehensive experimental details are deferred to Appendix A.2. Appendix A.2 provides hyperparamers for Gem Net and EGNN training including: Evaluation Interval: 1 epoch, Decay steps: 1200000, Warmup steps: 10000, Decay patience: 50000, Decay cooldown: 50000, Learning rate: 10-4 for inaccurate training, 10-5 for finetuning, num layers: 5, embedding size: 128.