reproducibilityindex.ai

GPT-Critic: Offline Reinforcement Learning for End-to-End Task-Oriented Dialogue Systems

Authors: Youngsoo Jang, Jongmin Lee, Kee-Eung Kim

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we show the experimental results of GPT-critic on both automatic evaluation and human evaluation. First, we evaluate the performances of GPT-Critic on the Multi WOZ 2.0 (Budzianowski et al., 2018) as dataset-based automatic evaluation, compared with baseline methods including ofﬂine RL algorithms.
Researcher Affiliation	Academia	Youngsoo Jang1, Jongmin Lee1, Kee-Eung Kim1,2 1School of Computing, KAIST, Daejeon, Republic of Korea 2Graduate School of AI, KAIST, Daejeon, Republic of Korea {ysjang,jmlee}@ai.kaist.ac.kr, kekim@kaist.ac.kr
Pseudocode	Yes	Algorithm 1 GPT-Critic Input: Training dataset D0 = {{(gj, hj t, aj t, rj t, hj t+1)T t=0}N j=1}, policy network (GPT) πθ, critic network Qφ
Open Source Code	No	The paper does not provide an explicit statement about releasing source code for the described methodology or a direct link to a code repository.
Open Datasets	Yes	We evaluate our algorithm on the Multi WOZ 2.0 dataset, which is one of the representative task-oriented dialogue benchmarks. The Multi WOZ 2.0 is a large-scale multi-domain Wizard-of-Oz dataset, where a tourist (i.e. user) converses with a clerk (i.e. system) at the information center in a touristic city. (Budzianowski et al., 2018)
Dataset Splits	Yes	It consists of 8438/1000/1000 dialogues for training/validation/testing.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications, or cloud instance types) used for running the experiments.
Software Dependencies	No	The paper mentions using 'Hugging Face Transformers library' and 'codebase of UBAR (Yang et al., 2021)', and 'Distil GPT2 (Sanh et al., 2019)' but does not provide specific version numbers for these software components or libraries.
Experiment Setup	Yes	For the hyperparameters of ﬁne-tuning the GPT-2 model, we follow the setting in the public code of UBAR (Yang et al., 2021). We use N = 5 for the number of candidate actions {ak}N, and the set of candidate actions are constructed by vanilla softmax sampling from the policy, rather than beam search, to collect diverse actions. For each behavior cloning iteration, all models are ﬁne-tuned with a training dataset from the pre-trained GPT-2 and early stop according to the loss on the validation set.