reproducibilityindex.ai

LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset

Authors: Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric Xing, Joseph E. Gonzalez, Ion Stoica, Hao Zhang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate its versatility through four use cases: developing content moderation models that perform similarly to GPT-4, building a safety benchmark, training instruction-following models that perform similarly to Vicuna, and creating challenging benchmark questions. The results are presented in Table 3.
Researcher Affiliation	Academia	1 UC Berkeley 2 UC San Diego 3 Carnegie Mellon University 4 Stanford 5 MBZUAI
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The dataset is publicly available at https://huggingface.co/datasets/lmsys/lmsys-chat-1m. The code for this website is publicly available3. 3https://github.com/lm-sys/Fast Chat/tree/v0.2.26#serving-with-web-gui
Open Datasets	Yes	The dataset is publicly available at https://huggingface.co/datasets/lmsys/lmsys-chat-1m. LMSYS-Chat-1M is collected on our website2 from April to August 2023.
Dataset Splits	No	The paper describes data selection for training and evaluation sets for specific tasks, but does not provide specific train/validation/test splits or percentages for any of its models' training processes that would allow direct reproduction of the data partitioning.
Hardware Specification	Yes	We utilize dozens of A100 GPUs to host our website, serving a total of 25 models over the course of the timespan.
Software Dependencies	Yes	The text-moderation-latest (006) is the latest Open AI moderation API (Open AI, 2023b) introduced on 2023/8/25.
Experiment Setup	Yes	Instead of developing a classifier, we fine-tune a language model to generate explanations for why a particular message was flagged, based on the system prompt described in the moderation task (see Appendix B.2). The detailed system prompt and few-shot examples can be found in Appendix B.7.