Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
TeaMs-RL: Teaching LLMs to Generate Better Instruction Datasets via Reinforcement Learning
Authors: Shangding Gu, Alois Knoll, Ming Jin
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiments, 4.1 Enhancement of Instruction Diversity, 4.2 Compare with Wizard LM-7b on ARC and Hella Swag Benchmakrs, 4.3 Experiments of Model Privacy Attack, 4.4 Comparison Experiments on Apacal Eval Benchmarks, 4.6 Comparison Experiments of Solving a Math Problem, 4.8 Ablation Experiments Regarding Data size |
| Researcher Affiliation | Academia | Shangding Gu EMAIL UC Berkeley & Technical University of Munich, Alois Knoll EMAIL Technical University of Munich, Ming Jin EMAIL Virginia Tech |
| Pseudocode | Yes | Algorithm 1 The Pipeline of training LLMs. (Page 5), Algorithm 2 Tea Ms-RL: Teaching LLMs to Generate Better Instruction Datasets via RL. (Page 9), Algorithm 3 TRPO(Schulman et al., 2015) (Page 9) |
| Open Source Code | Yes | Code is available at the link: https: //github.com/Safe RL-Lab/Tea Ms-RL |
| Open Datasets | Yes | Specifically, we use the initial instructions from the Alpaca dataset. The fine-tuning process is executed as follows: We first employ the Alpaca dataset4 as the initial set of instructions and input the initial instructions into the expert LLM like Chat GPT. (Footnote 4: https://huggingface.co/datasets/tatsu-lab/alpaca) |
| Dataset Splits | No | We trained a llama-1-7b model denoted Tea Ms-RL-7b-v1.1 with our dataset of 17,878 instruction-response pairs, the training time is about 2 hours on 4 NVIDIA RTX A6000 GPUs. The paper does not specify train/validation/test splits for this generated dataset. It mentions few-shot settings for external benchmarks in "In our comparison experiments, we take the same settings: 25 shots for ARC, 10 shots for Hella Swag." |
| Hardware Specification | Yes | less than 1 hour on 2 NVIDIA RTX A6000 GPUs (Section 4.1) and the training time is about 2 hours on 4 NVIDIA RTX A6000 GPUs (Section 4.2). |
| Software Dependencies | No | The paper mentions using "Wizard LM-13b model" and "Chat GPT-3.5 and Chat GPT-4" as expert LLMs, and "Llama-1-chat-7b and Llama-2-chat-7b" as base models. However, it does not explicitly state specific version numbers for underlying software libraries like Python, PyTorch, or CUDA used for implementation. |
| Experiment Setup | Yes | Table 5: The key hyper-parameters for TRPO. (Page 10) and Table 6: The key hyper-parameters for SFT. (Page 11) |