SKILL-MIX: a Flexible and Expandable Family of Evaluations for AI Models
Authors: Dingli Yu, Simran Kaur, Arushi Gupta, Jonah Brown-Cohen, Anirudh Goyal, Sanjeev Arora
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This work introduces SKILL-MIX, a new evaluation to measure ability to combine skills. Administering a version of SKILL-MIX to popular chatbots gave results that, while generally in line with prior expectations, contained surprises. Section 5 shows results of administering SKILL-MIX (k) for different k to today s leading models. We evaluate various instruction-tuned models on SKILL-MIX (k) with k = 2, 3, 4. |
| Researcher Affiliation | Collaboration | Dingli Yu1 Simran Kaur1 Arushi Gupta1 Jonah Brown-Cohen2 Anirudh Goyal2 Sanjeev Arora1 1Princeton Language and Intelligence (PLI), Princeton University 2Google Deep Mind |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled as "Pseudocode" or "Algorithm". |
| Open Source Code | No | The paper mentions maintaining a leaderboard at skill-mix.github.io, but it does not state that the source code for the methodology or its implementation is publicly available. |
| Open Datasets | Yes | Using Red Pajama dataset (Computer, 2023) we identified skills that have a frequency of at least 5% and removed all 17 of such skills. |
| Dataset Splits | No | The paper does not explicitly describe standard training/validation/test dataset splits for the SKILL-MIX evaluation itself or for the fine-tuning conducted. |
| Hardware Specification | Yes | For the LLa MA-2 family, we use 2 A100 GPU and run with no system prompt, 0.7 temperature, 1.0 repetition penalty, and 512 max new tokens. |
| Software Dependencies | No | The paper mentions using the NLTK package and models like GPT-4, LLa MA-2, and others, but it does not provide specific version numbers for these software components or libraries. |
| Experiment Setup | Yes | For the LLa MA-2 family, we use 2 A100 GPU and run with no system prompt, 0.7 temperature, 1.0 repetition penalty, and 512 max new tokens. For the GPT family, we use Open AI API with default generation configuration and the minimal system prompt You are a helpful assistant. We do not use quantization on any of the models. Training details. Recall the generation contains two rounds of conversations, we feed both rounds to LLa MA-2-7B-Chat and fine-tune only on tokens of GPT-4 output. We use Lo RA (Hu et al., 2021), exponential decaying learning rate starting at 1 10 4 and fine-tune for 3 epochs. |