Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

TeaMs-RL: Teaching LLMs to Generate Better Instruction Datasets via Reinforcement Learning

Authors: Shangding Gu, Alois Knoll, Ming Jin

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experiments, 4.1 Enhancement of Instruction Diversity, 4.2 Compare with Wizard LM-7b on ARC and Hella Swag Benchmakrs, 4.3 Experiments of Model Privacy Attack, 4.4 Comparison Experiments on Apacal Eval Benchmarks, 4.6 Comparison Experiments of Solving a Math Problem, 4.8 Ablation Experiments Regarding Data size
Researcher Affiliation Academia Shangding Gu EMAIL UC Berkeley & Technical University of Munich, Alois Knoll EMAIL Technical University of Munich, Ming Jin EMAIL Virginia Tech
Pseudocode Yes Algorithm 1 The Pipeline of training LLMs. (Page 5), Algorithm 2 Tea Ms-RL: Teaching LLMs to Generate Better Instruction Datasets via RL. (Page 9), Algorithm 3 TRPO(Schulman et al., 2015) (Page 9)
Open Source Code Yes Code is available at the link: https: //github.com/Safe RL-Lab/Tea Ms-RL
Open Datasets Yes Specifically, we use the initial instructions from the Alpaca dataset. The fine-tuning process is executed as follows: We first employ the Alpaca dataset4 as the initial set of instructions and input the initial instructions into the expert LLM like Chat GPT. (Footnote 4: https://huggingface.co/datasets/tatsu-lab/alpaca)
Dataset Splits No We trained a llama-1-7b model denoted Tea Ms-RL-7b-v1.1 with our dataset of 17,878 instruction-response pairs, the training time is about 2 hours on 4 NVIDIA RTX A6000 GPUs. The paper does not specify train/validation/test splits for this generated dataset. It mentions few-shot settings for external benchmarks in "In our comparison experiments, we take the same settings: 25 shots for ARC, 10 shots for Hella Swag."
Hardware Specification Yes less than 1 hour on 2 NVIDIA RTX A6000 GPUs (Section 4.1) and the training time is about 2 hours on 4 NVIDIA RTX A6000 GPUs (Section 4.2).
Software Dependencies No The paper mentions using "Wizard LM-13b model" and "Chat GPT-3.5 and Chat GPT-4" as expert LLMs, and "Llama-1-chat-7b and Llama-2-chat-7b" as base models. However, it does not explicitly state specific version numbers for underlying software libraries like Python, PyTorch, or CUDA used for implementation.
Experiment Setup Yes Table 5: The key hyper-parameters for TRPO. (Page 10) and Table 6: The key hyper-parameters for SFT. (Page 11)