CoCo: Controllable Counterfactuals for Evaluating Dialogue State Trackers

Authors: SHIYANG LI, Semih Yavuz, Kazuma Hashimoto, Jia Li, Tong Niu, Nazneen Rajani, Xifeng Yan, Yingbo Zhou, Caiming Xiong

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Evaluating state-of-the-art DST models on Multi WOZ dataset with COCO-generated counterfactuals results in a significant performance drop of up to 30.8% (from 49.4% to 18.6%) in absolute joint goal accuracy.
Researcher Affiliation Collaboration Salesforce Research University of California, Santa Barbara
Pseudocode No The paper provides diagrams (e.g., Figure 1, Figure 2) to illustrate processes but does not include any pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/salesforce/coco-dst
Open Datasets Yes We train each of these three models following their publicly released implementations on the standard train/dev/test split of Multi WOZ 2.1 (Eric et al., 2019).
Dataset Splits Yes We train each of these three models following their publicly released implementations on the standard train/dev/test split of Multi WOZ 2.1 (Eric et al., 2019).
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU, GPU models, or cloud instance types) used for running the experiments.
Software Dependencies No The paper mentions using T5-small, BERT-base-uncased, and Adam optimizer, as well as PyTorch/Fairseq for NMT models, but does not provide specific version numbers for these software components.
Experiment Setup Yes During training, we use Adam optimizer (Kingma and Ba, 2015) with an initial learning rate 5e 5 and set linear warmup to be 200 steps. The batch size is set to 36 and training epoch is set to be 10. The maximum sequence length of both encoder and decoder is set to be 100.