Can Models Learn Skill Composition from Examples?

Authors: Haoyu Zhao, Simran Kaur, Dingli Yu, Anirudh Goyal, Sanjeev Arora

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we employ a setup akin to SKILL-MIX to evaluate the capacity of smaller models to learn compositional generalization from examples. Utilizing a diverse set of language skills including rhetorical, literary, reasoning, theory of mind, and common sense GPT-4 was used to generate text samples that exhibit random subsets of k skills. Subsequent fine-tuning of 7B and 13B parameter models on these combined skill texts, for increasing values of k, revealed the following findings: Training on combinations of k = 2 and 3 skills results in noticeable improvements in the ability to compose texts with k = 4 and 5 skills, despite models never having seen such examples during training.
Researcher Affiliation Collaboration Haoyu Zhao1,2 Simran Kaur1,2 Dingli Yu1,2 Anirudh Goyal3 Sanjeev Arora1,2 1 Department of Computer Science, Princeton University 2 Princeton Language and Intelligence (PLI), Princeton University 3 Meta {haoyu,arora}@cs.princeton.edu
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. The methods are described in prose.
Open Source Code No We have provided details that should be enough for reproducing the results, and we will provide the codebase in the final version.
Open Datasets No This dataset consists of 13,957 text pieces, each composed of randomly selected k skills with k = 1, 2, 3. We evaluate the capability of the fine-tuned models to combine an another set of held-out skills with potentially higher k.
Dataset Splits No We evaluate the SKILL-MIX(k) performance (k = 2, 3, 4, 5) for all the models fine-tuned on data generated in Section 3.1, i.e., DSKILL-MIX(1), DSKILL-MIX(2), and DSKILL-MIX(3). ... I. SKILL-MIX evaluation on training skills and topics. ... II. SKILL-MIX on held-out skills and topics.
Hardware Specification Yes All fine-tuning experiments are conducted on 4 Nvidia H100/A100 GPUs.
Software Dependencies No The paper mentions optimizers and specific models but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We fine-tune LLa MA-2-13B-Chat [28] and Mistral-7B-Instruct-v0.2 [16] on the data generated in Section 3.1 for 4000 steps with a batch size of 64. ... We use Adam as the optimizer and linear warmup for the first 64 steps, followed by a constant learning rate of 2e-5 for the remaining training steps. The maximum token length is set as 1024.