reproducibilityindex.ai

SKILL-MIX: a Flexible and Expandable Family of Evaluations for AI Models

Authors: Dingli Yu, Simran Kaur, Arushi Gupta, Jonah Brown-Cohen, Anirudh Goyal, Sanjeev Arora

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This work introduces SKILL-MIX, a new evaluation to measure ability to combine skills. Administering a version of SKILL-MIX to popular chatbots gave results that, while generally in line with prior expectations, contained surprises. Section 5 shows results of administering SKILL-MIX (k) for different k to today s leading models. We evaluate various instruction-tuned models on SKILL-MIX (k) with k = 2, 3, 4.
Researcher Affiliation	Collaboration	Dingli Yu1 Simran Kaur1 Arushi Gupta1 Jonah Brown-Cohen2 Anirudh Goyal2 Sanjeev Arora1 1Princeton Language and Intelligence (PLI), Princeton University 2Google Deep Mind
Pseudocode	No	The paper does not contain any sections or figures explicitly labeled as "Pseudocode" or "Algorithm".
Open Source Code	No	The paper mentions maintaining a leaderboard at skill-mix.github.io, but it does not state that the source code for the methodology or its implementation is publicly available.
Open Datasets	Yes	Using Red Pajama dataset (Computer, 2023) we identified skills that have a frequency of at least 5% and removed all 17 of such skills.
Dataset Splits	No	The paper does not explicitly describe standard training/validation/test dataset splits for the SKILL-MIX evaluation itself or for the fine-tuning conducted.
Hardware Specification	Yes	For the LLa MA-2 family, we use 2 A100 GPU and run with no system prompt, 0.7 temperature, 1.0 repetition penalty, and 512 max new tokens.
Software Dependencies	No	The paper mentions using the NLTK package and models like GPT-4, LLa MA-2, and others, but it does not provide specific version numbers for these software components or libraries.
Experiment Setup	Yes	For the LLa MA-2 family, we use 2 A100 GPU and run with no system prompt, 0.7 temperature, 1.0 repetition penalty, and 512 max new tokens. For the GPT family, we use Open AI API with default generation configuration and the minimal system prompt You are a helpful assistant. We do not use quantization on any of the models. Training details. Recall the generation contains two rounds of conversations, we feed both rounds to LLa MA-2-7B-Chat and fine-tune only on tokens of GPT-4 output. We use Lo RA (Hu et al., 2021), exponential decaying learning rate starting at 1 10 4 and fine-tune for 3 epochs.