GPT-Critic: Offline Reinforcement Learning for End-to-End Task-Oriented Dialogue Systems
Authors: Youngsoo Jang, Jongmin Lee, Kee-Eung Kim
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we show the experimental results of GPT-critic on both automatic evaluation and human evaluation. First, we evaluate the performances of GPT-Critic on the Multi WOZ 2.0 (Budzianowski et al., 2018) as dataset-based automatic evaluation, compared with baseline methods including offline RL algorithms. |
| Researcher Affiliation | Academia | Youngsoo Jang1, Jongmin Lee1, Kee-Eung Kim1,2 1School of Computing, KAIST, Daejeon, Republic of Korea 2Graduate School of AI, KAIST, Daejeon, Republic of Korea {ysjang,jmlee}@ai.kaist.ac.kr, kekim@kaist.ac.kr |
| Pseudocode | Yes | Algorithm 1 GPT-Critic Input: Training dataset D0 = {{(gj, hj t, aj t, rj t, hj t+1)T t=0}N j=1}, policy network (GPT) πθ, critic network Qφ |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code for the described methodology or a direct link to a code repository. |
| Open Datasets | Yes | We evaluate our algorithm on the Multi WOZ 2.0 dataset, which is one of the representative task-oriented dialogue benchmarks. The Multi WOZ 2.0 is a large-scale multi-domain Wizard-of-Oz dataset, where a tourist (i.e. user) converses with a clerk (i.e. system) at the information center in a touristic city. (Budzianowski et al., 2018) |
| Dataset Splits | Yes | It consists of 8438/1000/1000 dialogues for training/validation/testing. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications, or cloud instance types) used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'Hugging Face Transformers library' and 'codebase of UBAR (Yang et al., 2021)', and 'Distil GPT2 (Sanh et al., 2019)' but does not provide specific version numbers for these software components or libraries. |
| Experiment Setup | Yes | For the hyperparameters of fine-tuning the GPT-2 model, we follow the setting in the public code of UBAR (Yang et al., 2021). We use N = 5 for the number of candidate actions {ak}N, and the set of candidate actions are constructed by vanilla softmax sampling from the policy, rather than beam search, to collect diverse actions. For each behavior cloning iteration, all models are fine-tuned with a training dataset from the pre-trained GPT-2 and early stop according to the loss on the validation set. |