Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SGD-X: A Benchmark for Robust Generalization in Schema-Guided Dialogue Systems

Authors: Harrison Lee, Raghav Gupta, Abhinav Rastogi, Yuan Cao, Bin Zhang, Yonghui Wu10938-10946

AAAI 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We explore the robustness of dialogue systems to linguistic variations in schemas by designing SGD-X a benchmark extending SGD with semantically similar yet stylistically diverse variants for every schema. We observe that two top state tracking models fail to generalize well across schema variants, measured by joint goal accuracy and a novel metric for measuring schema sensitivity. Additionally, we present a simple model-agnostic data augmentation method to improve schema robustness.
Researcher Affiliation	Industry	Google Research EMAIL
Pseudocode	No	The paper does not contain pseudocode or algorithm blocks. It references figures from external papers that might contain such elements but does not provide its own.
Open Source Code	Yes	We release SGD-X and an evaluation script for schema-guided dialogue state tracking models on Git Hub at https://github.com/google-research-datasets/dstc8-schema-guideddialogue
Open Datasets	Yes	The Schema-Guided Dialogue (SGD) dataset introduced a paradigm for enabling models to support any service in zero-shot through schemas, which describe service APIs to models in natural language.
Dataset Splits	No	The paper states that models are trained on the original SGD training set and evaluated on SGD-X, but it does not explicitly provide specific train/validation/test split percentages or sample counts within the paper.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running its experiments.
Software Dependencies	No	The paper mentions using "spa Cy (Honnibal et al. 2020)" and models like "BERT-Base" and "T5-Base" but does not specify version numbers for any software libraries or dependencies used in their implementation.
Experiment Setup	No	The paper states "More training details in the Appendix, available in the Ar Xiv version2 of this paper." but these details are not provided in the main text of the paper itself.