Unifying Molecular and Textual Representations via Multi-task Language Modelling

Authors: Dimitrios Christofidellis, Giorgio Giannone, Jannis Born, Ole Winther, Teodoro Laino, Matteo Manica

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experiments. Evaluation. Evaluating the model is challenging as it spans multiple domains. For this reason, we treat each task separately and we rely on a combination of NLP based as well as task-specific metrics. Table 2: Results across domains and tasks.
Researcher Affiliation Collaboration 1IBM Research Europe 2Technical University of Denmark 3Massachusetts Institute of Technology 4ETH Zurich 5University of Copenhagen.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code Availability. The Multitask Text and Chemistry T5 model is available for inference, training and finetuning via the GT4SD library (Manica et al., 2023): https://github.com/GT4SD/gt4sd-core. A gradio (Abid et al., 2019) app, build on top of the GT4SD implementation and hosted publicly on Hugging Face spaces, allows easy access to the models: https://huggingface.co/spaces/GT4SD/multitask-text-and-chemistry-t5. Code is available at: https://github.com/GT4SD/multitask_text_and_chemistry_t5
Open Datasets Yes Dataset. To train our model, we generated a multi-domain and a multi-task dataset by aggregating available datasets for each task of interest. Specifically, we leveraged the dataset used in Toniato et al. (2021) which has been derived by Pistachio dataset (Nextmove, 2023) (release of 18 November 2019) for mol2mol tasks. [...] Finally, we use the ChEBI-20 dataset (Edwards et al., 2021; 2022) ( 26k molecule-description pairs as training set, 3k pairs as validation set and 3k as testing set) for the description-to-smiles and smiles-to-caption tasks.
Dataset Splits Yes This dataset contains 2.3M reactants-products pairs as training set, 10k pairs as validation set and 10k pairs as testing set. For the paragraph-to-actions task, we relied on the procedures dataset (2.16M samples in the training set and 270k samples in the validation set and in the testing set) presented in Vaucher et al.. Finally, we use the ChEBI-20 dataset (Edwards et al., 2021; 2022) ( 26k molecule-description pairs as training set, 3k pairs as validation set and 3k as testing set) for the description-to-smiles and smiles-to-caption tasks.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running experiments.
Software Dependencies No The training process is carried out using the language modeling trainer based on Hugging Face transformers (Wolf et al., 2020) and Py Torch Lightning (Falcon and The Py Torch Lightning team, 2019) from the GT4SD library (Manica et al., 2023).
Experiment Setup Yes Table 12: Relevant Hyperparameters for Text+Chem T5. It lists: Heads, Layers, Epochs, Batch size, Accumulated gradient batches, Learning rate, Input max length, Parameters.