GPT-Critic: Offline Reinforcement Learning for End-to-End Task-Oriented Dialogue Systems

Authors: Youngsoo Jang, Jongmin Lee, Kee-Eung Kim

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we show the experimental results of GPT-critic on both automatic evaluation and human evaluation. First, we evaluate the performances of GPT-Critic on the Multi WOZ 2.0 (Budzianowski et al., 2018) as dataset-based automatic evaluation, compared with baseline methods including offline RL algorithms.
Researcher Affiliation Academia Youngsoo Jang1, Jongmin Lee1, Kee-Eung Kim1,2 1School of Computing, KAIST, Daejeon, Republic of Korea 2Graduate School of AI, KAIST, Daejeon, Republic of Korea {ysjang,jmlee}@ai.kaist.ac.kr, kekim@kaist.ac.kr
Pseudocode Yes Algorithm 1 GPT-Critic Input: Training dataset D0 = {{(gj, hj t, aj t, rj t, hj t+1)T t=0}N j=1}, policy network (GPT) πθ, critic network Qφ
Open Source Code No The paper does not provide an explicit statement about releasing source code for the described methodology or a direct link to a code repository.
Open Datasets Yes We evaluate our algorithm on the Multi WOZ 2.0 dataset, which is one of the representative task-oriented dialogue benchmarks. The Multi WOZ 2.0 is a large-scale multi-domain Wizard-of-Oz dataset, where a tourist (i.e. user) converses with a clerk (i.e. system) at the information center in a touristic city. (Budzianowski et al., 2018)
Dataset Splits Yes It consists of 8438/1000/1000 dialogues for training/validation/testing.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications, or cloud instance types) used for running the experiments.
Software Dependencies No The paper mentions using 'Hugging Face Transformers library' and 'codebase of UBAR (Yang et al., 2021)', and 'Distil GPT2 (Sanh et al., 2019)' but does not provide specific version numbers for these software components or libraries.
Experiment Setup Yes For the hyperparameters of fine-tuning the GPT-2 model, we follow the setting in the public code of UBAR (Yang et al., 2021). We use N = 5 for the number of candidate actions {ak}N, and the set of candidate actions are constructed by vanilla softmax sampling from the policy, rather than beam search, to collect diverse actions. For each behavior cloning iteration, all models are fine-tuned with a training dataset from the pre-trained GPT-2 and early stop according to the loss on the validation set.