Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Can Models Learn Skill Composition from Examples?
Authors: Haoyu Zhao, Simran Kaur, Dingli Yu, Anirudh Goyal, Sanjeev Arora
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we employ a setup akin to SKILL-MIX to evaluate the capacity of smaller models to learn compositional generalization from examples. Utilizing a diverse set of language skills including rhetorical, literary, reasoning, theory of mind, and common sense GPT-4 was used to generate text samples that exhibit random subsets of k skills. Subsequent fine-tuning of 7B and 13B parameter models on these combined skill texts, for increasing values of k, revealed the following findings: Training on combinations of k = 2 and 3 skills results in noticeable improvements in the ability to compose texts with k = 4 and 5 skills, despite models never having seen such examples during training. |
| Researcher Affiliation | Collaboration | Haoyu Zhao1,2 Simran Kaur1,2 Dingli Yu1,2 Anirudh Goyal3 Sanjeev Arora1,2 1 Department of Computer Science, Princeton University 2 Princeton Language and Intelligence (PLI), Princeton University 3 Meta EMAIL |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. The methods are described in prose. |
| Open Source Code | No | We have provided details that should be enough for reproducing the results, and we will provide the codebase in the final version. |
| Open Datasets | No | This dataset consists of 13,957 text pieces, each composed of randomly selected k skills with k = 1, 2, 3. We evaluate the capability of the fine-tuned models to combine an another set of held-out skills with potentially higher k. |
| Dataset Splits | No | We evaluate the SKILL-MIX(k) performance (k = 2, 3, 4, 5) for all the models fine-tuned on data generated in Section 3.1, i.e., DSKILL-MIX(1), DSKILL-MIX(2), and DSKILL-MIX(3). ... I. SKILL-MIX evaluation on training skills and topics. ... II. SKILL-MIX on held-out skills and topics. |
| Hardware Specification | Yes | All fine-tuning experiments are conducted on 4 Nvidia H100/A100 GPUs. |
| Software Dependencies | No | The paper mentions optimizers and specific models but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We fine-tune LLa MA-2-13B-Chat [28] and Mistral-7B-Instruct-v0.2 [16] on the data generated in Section 3.1 for 4000 steps with a batch size of 64. ... We use Adam as the optimizer and linear warmup for the first 64 steps, followed by a constant learning rate of 2e-5 for the remaining training steps. The maximum token length is set as 1024. |