Language Models as Science Tutors

Authors: Alexis Chevalier, Jiayi Geng, Alexander Wettig, Howard Chen, Sebastian Mizera, Toni Annala, Max Aragon, Arturo Rodriguez Fanlo, Simon Frieder, Simon Machado, Akshara Prabhakar, Ellie Thieu, Jiachen T. Wang, Zirui Wang, Xindi Wu, Mengzhou Xia, Wenhan Xia, Jiatong Yu, Junjie Zhu, Zhiyong Ren, Sanjeev Arora, Danqi Chen

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental TUTOREVAL helps measure real-life usability of LMs as scientific assistants, and it is the first benchmark combining long contexts, free-form generation, and multidisciplinary scientific knowledge. Moreover, we show that fine-tuning base models with existing dialogue datasets leads to poor performance on TUTOREVAL. Therefore, we create TUTORCHAT, a dataset of 80,000 long synthetic dialogues about textbooks. We use TUTORCHAT to fine-tune Llemma models with 7B and 34B parameters. These LM tutors specialized in math have a 32K-token context window, and they excel at TUTOREVAL while performing strongly on GSM8K and MATH.
Researcher Affiliation Collaboration 1Princeton Language and Intelligence, Princeton University 2School of Natural Sciences, Institute for Advanced Study 3School of Mathematics, Institute for Advanced Study 4Neuroscience Institute, Princeton University 5Hebrew University of Jerusalem 6Department of Computer Science, Oxford University 7FIM Institute for Mathematical Research, ETH Zürich 8Department of Computer Science, University of Wisconsin-Madison 9Meta FAIR 10Department of Civil and Environmental Engineering, Princeton University 11Andlinger Center for Energy and the Environment, Princeton University.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes We release competitive long-context models specialized in science and math reasoning, as well as all our data and evaluations at https://github.com/princeton-nlp/LM-Science-Tutor.
Open Datasets Yes Our datasets build on open-source materials, and we release our models, data, and evaluations publicly.
Dataset Splits Yes We create a validation split of 2.5K samples from TUTORCHAT.
Hardware Specification Yes We use 16 H100 GPUs to fine-tune Llemma-7B-32K on this dataset. To fine-tune Llemma-7B-32K, we use one A100 GPU with 80GB memory. To fine-tune Llemma-34B, we use 32 H100 GPUs.
Software Dependencies No The paper mentions software like "Flash Attention" but does not provide specific version numbers for software dependencies used in their experiments.
Experiment Setup Yes We use a batch-size of 512, a learning rate 2e-5 with a 10% warm-up, and the Adam optimizer (Kingma & Ba, 2015). We always fine-tune for two epochs, with a batch size of 16 and a learning rate of 1e-5 and a 10% warm-up.