On Compositional Uncertainty Quantification for Seq2seq Graph Parsing
Authors: Zi Lin, Du Phan, Panupong Pasupat, Jeremiah Zhe Liu, Jingbo Shang
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | By a thorough evaluation for compositional uncertainty on three different tasks across ten domains, we demonstrate that CECE is a better reflection for distributional shift compared to vanilla sequence ECE. Finally, we validate the effectiveness of compositional uncertainty considering the task of collaborative semantic parsing, where the model is allowed to send limited subgraphs for human review. The results show that the collaborative performance based on uncertain subgraph selection consistently outperforms random subgraph selection (30% average error reduction rate) and performs comparably to oracle subgraph selection (only 0.33 difference in average prediction error), indicating that compositional uncertainty is an ideal signal for model errors and can benefit various downstream tasks. |
| Researcher Affiliation | Collaboration | Zi Lin1,2 Du Phan2 Panupong Pasupat2 Jeremiah Liu2 3 Jingbo Shang1 1UC San Diego 2Google Research 3Harvard University |
| Pseudocode | Yes | Algorithm 1 Graph Autoregressive Process (GAP) |
| Open Source Code | Yes | Open-source code may be found at https://github.com/google/uncertainty-baselines. |
| Open Datasets | Yes | Redwoods: The Lin GO Redwoods Treebank... (Flickinger et al., 2014; Bender et al., 2015)... SMCal Flow (Andreas et al., 2020)... SNIPS (Coucke et al., 2018) |
| Dataset Splits | Yes | For in-domain test, we train and evaluate models on the subset treebank corresponding to the 25 Wall Street Journal (WSJ) sections with standard data splits (Flickinger et al., 2012). For out-of-domain (OOD) evaluations, we select 7 diverse datasets from Redwoods... We use the standard data split in the original paper and evaluate inference results on development set. We train models on five source domains, use a sixth one for development, and test on the remaining domian. |
| Hardware Specification | No | The paper mentions using 'T5-large (770 million parameters)' and 'T5X' but does not specify any hardware details like GPU models, CPU types, or memory used for experiments. |
| Software Dependencies | No | The paper mentions using 'T5X' and 'JAX and Flax' but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | No | The paper mentions fine-tuning 'T5-large' and evaluating different uncertainty baselines (e.g., 'Monte Carlo Dropout' with '5 dropout samples', 'Deep Ensemble' with '4 deterministic models'). However, it does not provide specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or detailed optimizer settings for the experimental setup. |