On Compositional Uncertainty Quantification for Seq2seq Graph Parsing

Authors: Zi Lin, Du Phan, Panupong Pasupat, Jeremiah Zhe Liu, Jingbo Shang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental By a thorough evaluation for compositional uncertainty on three different tasks across ten domains, we demonstrate that CECE is a better reflection for distributional shift compared to vanilla sequence ECE. Finally, we validate the effectiveness of compositional uncertainty considering the task of collaborative semantic parsing, where the model is allowed to send limited subgraphs for human review. The results show that the collaborative performance based on uncertain subgraph selection consistently outperforms random subgraph selection (30% average error reduction rate) and performs comparably to oracle subgraph selection (only 0.33 difference in average prediction error), indicating that compositional uncertainty is an ideal signal for model errors and can benefit various downstream tasks.
Researcher Affiliation Collaboration Zi Lin1,2 Du Phan2 Panupong Pasupat2 Jeremiah Liu2 3 Jingbo Shang1 1UC San Diego 2Google Research 3Harvard University
Pseudocode Yes Algorithm 1 Graph Autoregressive Process (GAP)
Open Source Code Yes Open-source code may be found at https://github.com/google/uncertainty-baselines.
Open Datasets Yes Redwoods: The Lin GO Redwoods Treebank... (Flickinger et al., 2014; Bender et al., 2015)... SMCal Flow (Andreas et al., 2020)... SNIPS (Coucke et al., 2018)
Dataset Splits Yes For in-domain test, we train and evaluate models on the subset treebank corresponding to the 25 Wall Street Journal (WSJ) sections with standard data splits (Flickinger et al., 2012). For out-of-domain (OOD) evaluations, we select 7 diverse datasets from Redwoods... We use the standard data split in the original paper and evaluate inference results on development set. We train models on five source domains, use a sixth one for development, and test on the remaining domian.
Hardware Specification No The paper mentions using 'T5-large (770 million parameters)' and 'T5X' but does not specify any hardware details like GPU models, CPU types, or memory used for experiments.
Software Dependencies No The paper mentions using 'T5X' and 'JAX and Flax' but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup No The paper mentions fine-tuning 'T5-large' and evaluating different uncertainty baselines (e.g., 'Monte Carlo Dropout' with '5 dropout samples', 'Deep Ensemble' with '4 deterministic models'). However, it does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or detailed optimizer settings for the experimental setup.