Adapting Large Language Models via Reading Comprehension
Authors: Daixuan Cheng, Shaohan Huang, Furu Wei
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We explore how continued pre-training on domain-specific corpora influences large language models, revealing that training on the raw corpora endows the model with domain knowledge, but drastically hurts its prompting ability for question answering. (...) We conduct preliminary experiments on three domains biomedicine, finance, and law (...) Our experiments in domains such as biomedicine, finance, and law highlight the effectiveness of our approach in improving model performance on various domain-specific tasks. (...) Table 4: Domain-specific task performance of the general language models, the models that has undergone vanilla domain-adaptive pre-training (DAPT), and ours (Adapt LLM) in prompting evaluation. We also display prompting results of Med Alpaca (Han et al., 2023) in biomedicine, Bloomberg GPT (Wu et al., 2023b) in finance, and Lex GPT (Lee, 2023) in law. (...) Table 5: Ablation results on training data. Raw Text refers to raw corpora, Read. Compre. refers to reading comprehension texts, Gen. Ins. refers to general instructions, and Raw. + Gen. Ins. and Read. + Gen. Ins. correspond to different data mixtures. We report the average of task scores in prompting evaluation within each domain. |
| Researcher Affiliation | Industry | Daixuan Cheng , Shaohan Huang & Furu Wei Microsoft Research Beijing Institute for General Artificial Intelligence (BIGAI) |
| Pseudocode | No | The paper describes the method in prose and provides figures illustrating examples, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our model, code, and data are available at https://github.com/microsoft/LMOps. |
| Open Datasets | Yes | Pub Med Abstracts and Free Law Opinions from the Pile (Gao et al., 2021) are utilized as the pre-training corpora for the biomedicine and law domains, respectively. For finance, we collect financial news from May 2022 to May 2023 for over 7, 000 stocks, using the Fin GPT codebase (Yang et al., 2023). General instructions are sourced from LIMA (Zhou et al., 2023), Wizard LM (Xu et al., 2023), and Orca (Mukherjee et al., 2023; Lian et al., 2023). (...) For biomedicine, we evaluate on Pub Med QA (Jin et al., 2019), Chem Prot (Kringelum et al., 2016), MQP (Mc Creery et al., 2020), RCT (Dernoncourt & Lee, 2017), and USMLE (Jin et al., 2020). |
| Dataset Splits | No | The paper describes its pre-training and fine-tuning processes and evaluates on various datasets. While it implicitly uses evaluation sets for validation, it does not provide specific details about how these datasets are split into distinct training, validation, and testing subsets, either by percentages, sample counts, or references to predefined validation splits. |
| Hardware Specification | Yes | Table 8: Hyper-parameters of domain-adaptive pre-training. Computing infrastructure 32 V100-32GB GPUs |
| Software Dependencies | No | The paper mentions key software components: "Our pre-training code is based on Torch Scale (Ma et al., 2022)" and "We use the Sentence Piece tool (Kudo & Richardson, 2018)". However, it does not specify version numbers for these tools or any other software libraries required for reproducibility. |
| Experiment Setup | Yes | Table 8: Hyper-parameters of domain-adaptive pre-training. Number of steps 10,000 Batch size 32 Maximum sequence length 2,048 Maximum learning rate 1e-5 Optimizer Adam Adam beta weights 0.9, 0.95 Learning rate scheduler cosine Weight decay 0.1 Warm-up steps 1000 Gradient clipping 1.0 Dropout ratio 0.1 |