Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

TeaMs-RL: Teaching LLMs to Generate Better Instruction Datasets via Reinforcement Learning

Authors: Shangding Gu, Alois Knoll, Ming Jin

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experiments, 4.1 Enhancement of Instruction Diversity, 4.2 Compare with Wizard LM-7b on ARC and Hella Swag Benchmakrs, 4.3 Experiments of Model Privacy Attack, 4.4 Comparison Experiments on Apacal Eval Benchmarks, 4.6 Comparison Experiments of Solving a Math Problem, 4.8 Ablation Experiments Regarding Data size
Researcher Affiliation Academia Shangding Gu EMAIL UC Berkeley & Technical University of Munich, Alois Knoll EMAIL Technical University of Munich, Ming Jin EMAIL Virginia Tech
Pseudocode Yes Algorithm 1 The Pipeline of training LLMs. (Page 5), Algorithm 2 Tea Ms-RL: Teaching LLMs to Generate Better Instruction Datasets via RL. (Page 9), Algorithm 3 TRPO(Schulman et al., 2015) (Page 9)
Open Source Code Yes Code is available at the link: https: //github.com/Safe RL-Lab/Tea Ms-RL
Open Datasets Yes Specifically, we use the initial instructions from the Alpaca dataset. The fine-tuning process is executed as follows: We first employ the Alpaca dataset4 as the initial set of instructions and input the initial instructions into the expert LLM like Chat GPT. (Footnote 4: https://huggingface.co/datasets/tatsu-lab/alpaca)
Dataset Splits No We trained a llama-1-7b model denoted Tea Ms-RL-7b-v1.1 with our dataset of 17,878 instruction-response pairs, the training time is about 2 hours on 4 NVIDIA RTX A6000 GPUs. The paper does not specify train/validation/test splits for this generated dataset. It mentions few-shot settings for external benchmarks in "In our comparison experiments, we take the same settings: 25 shots for ARC, 10 shots for Hella Swag."
Hardware Specification Yes less than 1 hour on 2 NVIDIA RTX A6000 GPUs (Section 4.1) and the training time is about 2 hours on 4 NVIDIA RTX A6000 GPUs (Section 4.2).
Software Dependencies No The paper mentions using "Wizard LM-13b model" and "Chat GPT-3.5 and Chat GPT-4" as expert LLMs, and "Llama-1-chat-7b and Llama-2-chat-7b" as base models. However, it does not explicitly state specific version numbers for underlying software libraries like Python, PyTorch, or CUDA used for implementation.
Experiment Setup Yes Table 5: The key hyper-parameters for TRPO. (Page 10) and Table 6: The key hyper-parameters for SFT. (Page 11)