Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
CoCo: Controllable Counterfactuals for Evaluating Dialogue State Trackers
Authors: SHIYANG LI, Semih Yavuz, Kazuma Hashimoto, Jia Li, Tong Niu, Nazneen Rajani, Xifeng Yan, Yingbo Zhou, Caiming Xiong
ICLR 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluating state-of-the-art DST models on Multi WOZ dataset with COCO-generated counterfactuals results in a significant performance drop of up to 30.8% (from 49.4% to 18.6%) in absolute joint goal accuracy. |
| Researcher Affiliation | Collaboration | Salesforce Research University of California, Santa Barbara |
| Pseudocode | No | The paper provides diagrams (e.g., Figure 1, Figure 2) to illustrate processes but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/salesforce/coco-dst |
| Open Datasets | Yes | We train each of these three models following their publicly released implementations on the standard train/dev/test split of Multi WOZ 2.1 (Eric et al., 2019). |
| Dataset Splits | Yes | We train each of these three models following their publicly released implementations on the standard train/dev/test split of Multi WOZ 2.1 (Eric et al., 2019). |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU, GPU models, or cloud instance types) used for running the experiments. |
| Software Dependencies | No | The paper mentions using T5-small, BERT-base-uncased, and Adam optimizer, as well as PyTorch/Fairseq for NMT models, but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | During training, we use Adam optimizer (Kingma and Ba, 2015) with an initial learning rate 5e 5 and set linear warmup to be 200 steps. The batch size is set to 36 and training epoch is set to be 10. The maximum sequence length of both encoder and decoder is set to be 100. |