SKILL-MIX: a Flexible and Expandable Family of Evaluations for AI Models

Authors: Dingli Yu, Simran Kaur, Arushi Gupta, Jonah Brown-Cohen, Anirudh Goyal, Sanjeev Arora

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This work introduces SKILL-MIX, a new evaluation to measure ability to combine skills. Administering a version of SKILL-MIX to popular chatbots gave results that, while generally in line with prior expectations, contained surprises. Section 5 shows results of administering SKILL-MIX (k) for different k to today s leading models. We evaluate various instruction-tuned models on SKILL-MIX (k) with k = 2, 3, 4.
Researcher Affiliation Collaboration Dingli Yu1 Simran Kaur1 Arushi Gupta1 Jonah Brown-Cohen2 Anirudh Goyal2 Sanjeev Arora1 1Princeton Language and Intelligence (PLI), Princeton University 2Google Deep Mind
Pseudocode No The paper does not contain any sections or figures explicitly labeled as "Pseudocode" or "Algorithm".
Open Source Code No The paper mentions maintaining a leaderboard at skill-mix.github.io, but it does not state that the source code for the methodology or its implementation is publicly available.
Open Datasets Yes Using Red Pajama dataset (Computer, 2023) we identified skills that have a frequency of at least 5% and removed all 17 of such skills.
Dataset Splits No The paper does not explicitly describe standard training/validation/test dataset splits for the SKILL-MIX evaluation itself or for the fine-tuning conducted.
Hardware Specification Yes For the LLa MA-2 family, we use 2 A100 GPU and run with no system prompt, 0.7 temperature, 1.0 repetition penalty, and 512 max new tokens.
Software Dependencies No The paper mentions using the NLTK package and models like GPT-4, LLa MA-2, and others, but it does not provide specific version numbers for these software components or libraries.
Experiment Setup Yes For the LLa MA-2 family, we use 2 A100 GPU and run with no system prompt, 0.7 temperature, 1.0 repetition penalty, and 512 max new tokens. For the GPT family, we use Open AI API with default generation configuration and the minimal system prompt You are a helpful assistant. We do not use quantization on any of the models. Training details. Recall the generation contains two rounds of conversations, we feed both rounds to LLa MA-2-7B-Chat and fine-tune only on tokens of GPT-4 output. We use Lo RA (Hu et al., 2021), exponential decaying learning rate starting at 1 10 4 and fine-tune for 3 epochs.