Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
TeaMs-RL: Teaching LLMs to Generate Better Instruction Datasets via Reinforcement Learning
Authors: Shangding Gu, Alois Knoll, Ming Jin
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiments, 4.1 Enhancement of Instruction Diversity, 4.2 Compare with Wizard LM-7b on ARC and Hella Swag Benchmakrs, 4.3 Experiments of Model Privacy Attack, 4.4 Comparison Experiments on Apacal Eval Benchmarks, 4.6 Comparison Experiments of Solving a Math Problem, 4.8 Ablation Experiments Regarding Data size |
| Researcher Affiliation | Academia | Shangding Gu EMAIL UC Berkeley & Technical University of Munich, Alois Knoll EMAIL Technical University of Munich, Ming Jin EMAIL Virginia Tech |
| Pseudocode | Yes | Algorithm 1 The Pipeline of training LLMs. (Page 5), Algorithm 2 Tea Ms-RL: Teaching LLMs to Generate Better Instruction Datasets via RL. (Page 9), Algorithm 3 TRPO(Schulman et al., 2015) (Page 9) |
| Open Source Code | Yes | Code is available at the link: https: //github.com/Safe RL-Lab/Tea Ms-RL |
| Open Datasets | Yes | Specifically, we use the initial instructions from the Alpaca dataset. The fine-tuning process is executed as follows: We first employ the Alpaca dataset4 as the initial set of instructions and input the initial instructions into the expert LLM like Chat GPT. (Footnote 4: https://huggingface.co/datasets/tatsu-lab/alpaca) |
| Dataset Splits | No | We trained a llama-1-7b model denoted Tea Ms-RL-7b-v1.1 with our dataset of 17,878 instruction-response pairs, the training time is about 2 hours on 4 NVIDIA RTX A6000 GPUs. The paper does not specify train/validation/test splits for this generated dataset. It mentions few-shot settings for external benchmarks in "In our comparison experiments, we take the same settings: 25 shots for ARC, 10 shots for Hella Swag." |
| Hardware Specification | Yes | less than 1 hour on 2 NVIDIA RTX A6000 GPUs (Section 4.1) and the training time is about 2 hours on 4 NVIDIA RTX A6000 GPUs (Section 4.2). |
| Software Dependencies | No | The paper mentions using "Wizard LM-13b model" and "Chat GPT-3.5 and Chat GPT-4" as expert LLMs, and "Llama-1-chat-7b and Llama-2-chat-7b" as base models. However, it does not explicitly state specific version numbers for underlying software libraries like Python, PyTorch, or CUDA used for implementation. |
| Experiment Setup | Yes | Table 5: The key hyper-parameters for TRPO. (Page 10) and Table 6: The key hyper-parameters for SFT. (Page 11) |