Can Models Learn Skill Composition from Examples?
Authors: Haoyu Zhao, Simran Kaur, Dingli Yu, Anirudh Goyal, Sanjeev Arora
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we employ a setup akin to SKILL-MIX to evaluate the capacity of smaller models to learn compositional generalization from examples. Utilizing a diverse set of language skills including rhetorical, literary, reasoning, theory of mind, and common sense GPT-4 was used to generate text samples that exhibit random subsets of k skills. Subsequent fine-tuning of 7B and 13B parameter models on these combined skill texts, for increasing values of k, revealed the following findings: Training on combinations of k = 2 and 3 skills results in noticeable improvements in the ability to compose texts with k = 4 and 5 skills, despite models never having seen such examples during training. |
| Researcher Affiliation | Collaboration | Haoyu Zhao1,2 Simran Kaur1,2 Dingli Yu1,2 Anirudh Goyal3 Sanjeev Arora1,2 1 Department of Computer Science, Princeton University 2 Princeton Language and Intelligence (PLI), Princeton University 3 Meta {haoyu,arora}@cs.princeton.edu |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. The methods are described in prose. |
| Open Source Code | No | We have provided details that should be enough for reproducing the results, and we will provide the codebase in the final version. |
| Open Datasets | No | This dataset consists of 13,957 text pieces, each composed of randomly selected k skills with k = 1, 2, 3. We evaluate the capability of the fine-tuned models to combine an another set of held-out skills with potentially higher k. |
| Dataset Splits | No | We evaluate the SKILL-MIX(k) performance (k = 2, 3, 4, 5) for all the models fine-tuned on data generated in Section 3.1, i.e., DSKILL-MIX(1), DSKILL-MIX(2), and DSKILL-MIX(3). ... I. SKILL-MIX evaluation on training skills and topics. ... II. SKILL-MIX on held-out skills and topics. |
| Hardware Specification | Yes | All fine-tuning experiments are conducted on 4 Nvidia H100/A100 GPUs. |
| Software Dependencies | No | The paper mentions optimizers and specific models but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We fine-tune LLa MA-2-13B-Chat [28] and Mistral-7B-Instruct-v0.2 [16] on the data generated in Section 3.1 for 4000 steps with a batch size of 64. ... We use Adam as the optimizer and linear warmup for the first 64 steps, followed by a constant learning rate of 2e-5 for the remaining training steps. The maximum token length is set as 1024. |