reproducibilityindex.ai

WildChat: 1M ChatGPT Interaction Logs in the Wild

Authors: Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, Yuntian Deng

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We compare WILDCHAT with other popular user-chatbot interaction datasets, and find that our dataset offers the most diverse user prompts, contains the largest number of languages, and presents the richest variety of potentially toxic use-cases for researchers to study. ... Finally, because it captures a broad range of use cases, we demonstrate the dataset s potential utility in fine-tuning instruction-following models. WILDCHAT is released at https://wildchat.allen.ai under AI2 Imp ACT Licenses1. ... We used LLM Judge to evaluate WILDLLAMA on MT-bench (Zheng et al., 2023), which evaluates chatbot responses across various dimensions such as writing, roleplay, coding, mathematics, reasoning, STEM, and humanities, using GPT-4 for grading.
Researcher Affiliation	Collaboration	Wenting Zhao1 Xiang Ren2,3 Jack Hessel2 Claire Cardie1 Yejin Choi2,4 Yuntian Deng2 1Cornell University 2Allen Institute for Artificial Intelligence 3University of Southern California 4University of Washington
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper states: 'WILDCHAT is released at https://wildchat.allen.ai under AI2 Imp ACT Licenses1.' This link points to the dataset, not the source code for the methodology described (e.g., data collection scripts, analysis scripts, or fine-tuning code for WILDLLAMA).
Open Datasets	Yes	To bridge this gap, we offered free access to Chat GPT for online users in exchange for their affirmative, consensual opt-in to anonymously collect their chat transcripts and request headers. From this, we compiled WILDCHAT, a corpus of 1 million user-Chat GPT conversations, which consists of over 2.5 million interaction turns. ... WILDCHAT is released at https://wildchat.allen.ai under AI2 Imp ACT Licenses1.
Dataset Splits	Yes	The heatmap shows the average NLLs of finetuning Llama-2 7B on one dataset and evaluating NLLs on the other datasets, using 70% data for training and 30% for validation. We only used the user prompts in the first turn of each conversation.
Hardware Specification	Yes	We used four NVIDIA A100 GPUs with 80G memory, an effective batch size of 128 conversations, a learning rate of 2e-5, and a maximum sequence length of 2048 tokens. Any conversations exceeding this length were divided into multiple conversations. We fine-tuned WILDLLAMA for three epochs.
Software Dependencies	No	The paper mentions software like 'Llama-2 tokenizer', 'Microsoft s Presidio', 'Spacy', 'lingua-py', 'LLM Judge', and 'Fast Chat' but does not provide specific version numbers for these dependencies.
Experiment Setup	Yes	For the training of WILDLLAMA, we used WILDCHAT collected up until July 16, 2023. To ensure a direct comparison with the state-of-the-art in open-sourced chatbot models, we adopted the same implementation and hyperparameters as those used for the Vicuna model17. We used four NVIDIA A100 GPUs with 80G memory, an effective batch size of 128 conversations, a learning rate of 2e-5, and a maximum sequence length of 2048 tokens. Any conversations exceeding this length were divided into multiple conversations. We fine-tuned WILDLLAMA for three epochs.