Language Models as Science Tutors
Authors: Alexis Chevalier, Jiayi Geng, Alexander Wettig, Howard Chen, Sebastian Mizera, Toni Annala, Max Aragon, Arturo Rodriguez Fanlo, Simon Frieder, Simon Machado, Akshara Prabhakar, Ellie Thieu, Jiachen T. Wang, Zirui Wang, Xindi Wu, Mengzhou Xia, Wenhan Xia, Jiatong Yu, Junjie Zhu, Zhiyong Ren, Sanjeev Arora, Danqi Chen
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | TUTOREVAL helps measure real-life usability of LMs as scientific assistants, and it is the first benchmark combining long contexts, free-form generation, and multidisciplinary scientific knowledge. Moreover, we show that fine-tuning base models with existing dialogue datasets leads to poor performance on TUTOREVAL. Therefore, we create TUTORCHAT, a dataset of 80,000 long synthetic dialogues about textbooks. We use TUTORCHAT to fine-tune Llemma models with 7B and 34B parameters. These LM tutors specialized in math have a 32K-token context window, and they excel at TUTOREVAL while performing strongly on GSM8K and MATH. |
| Researcher Affiliation | Collaboration | 1Princeton Language and Intelligence, Princeton University 2School of Natural Sciences, Institute for Advanced Study 3School of Mathematics, Institute for Advanced Study 4Neuroscience Institute, Princeton University 5Hebrew University of Jerusalem 6Department of Computer Science, Oxford University 7FIM Institute for Mathematical Research, ETH Zürich 8Department of Computer Science, University of Wisconsin-Madison 9Meta FAIR 10Department of Civil and Environmental Engineering, Princeton University 11Andlinger Center for Energy and the Environment, Princeton University. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release competitive long-context models specialized in science and math reasoning, as well as all our data and evaluations at https://github.com/princeton-nlp/LM-Science-Tutor. |
| Open Datasets | Yes | Our datasets build on open-source materials, and we release our models, data, and evaluations publicly. |
| Dataset Splits | Yes | We create a validation split of 2.5K samples from TUTORCHAT. |
| Hardware Specification | Yes | We use 16 H100 GPUs to fine-tune Llemma-7B-32K on this dataset. To fine-tune Llemma-7B-32K, we use one A100 GPU with 80GB memory. To fine-tune Llemma-34B, we use 32 H100 GPUs. |
| Software Dependencies | No | The paper mentions software like "Flash Attention" but does not provide specific version numbers for software dependencies used in their experiments. |
| Experiment Setup | Yes | We use a batch-size of 512, a learning rate 2e-5 with a 10% warm-up, and the Adam optimizer (Kingma & Ba, 2015). We always fine-tune for two epochs, with a batch size of 16 and a learning rate of 1e-5 and a 10% warm-up. |