Conditional Language Learning with Context

Authors: Xiao Zhang, Miao Li, Ji Wu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate knowledge learning in finetuned language models with question answering tasks, a common approach in previous work (Hendrycks et al., 2021; Singhal et al., 2023). Figure 5. Performance-forgetting tradeoff curve of standard finetuning and conditional finetuning on Anatomy and SQu AD (closed-book).
Researcher Affiliation Academia 1Department of Electronics Engineering, Tsinghua University 2College of AI, Tsinghua University. Correspondence to: Ji Wu <wuji ee@mail.tsinghua.edu.cn>.
Pseudocode No The paper describes the method verbally and with a diagram, but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes We release our code implementation along with the original part of the data used in the paper1. 1https://github.com/xiaozeroone/ conditional_finetune
Open Datasets Yes We use the medical textbooks provided with the Med QA dataset (Jin et al., 2021) as a domain corpus to finetune LLa MA-2 (Touvron et al., 2023b) C4 (Raffel et al., 2020), a corpus of general web text.
Dataset Splits No The paper mentions evaluating on a validation split of C4 for perplexity, but does not explicitly provide the train/validation/test splits for its own training data (medical textbooks, Wikipedia excerpts) in a reproducible manner for the models being finetuned.
Hardware Specification Yes We use the Transformers library (Wolf et al., 2020) and an NVIDIA A100 GPU for the experiments.
Software Dependencies No The paper mentions using 'Transformers library (Wolf et al., 2020)', 'Adam W optimizer (Loshchilov & Hutter, 2019)', 'PEFT (Mangrulkar et al., 2022) library', and 'Eleuther AI s Language Model Evaluation Harness framework (Gao et al., 2021)' but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes We finetune the model with the Adam W optimizer (Loshchilov & Hutter, 2019), a learning rate of 3e-5, and a batch size of 16. The maximum sequence length is set to 2048. A linear learning rate decay is used with a warm-up of 10% of the total number of steps. We use a gradient clipping at 1.0.