Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Ensembling Graph Predictions for AMR Parsing
Authors: Thanh Lam Hoang, Gabriele Picco, Yufang Hou, Young-Suk Lee, Lam Nguyen, Dzung Phan, Vanessa Lopez, Ramon Fernandez Astudillo
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To validate our approach, we carried out experiments in AMR parsing problems. The experimental results demonstrate that the proposed approach can combine the strength of state-of-the-art AMR parsers to create new predictions that are more accurate than any individual models in five standard benchmark datasets. |
| Researcher Affiliation | Industry | 1 IBM Research, Dublin, Ireland 2 IBM Research, Thomas J. Watson Research Center, Yorktown Heights, USA |
| Pseudocode | Yes | Algorithm 1: Graph ensemble with the Graphene algorithm. |
| Open Source Code | Yes | Source code is open-sourced1. 1https://github.com/IBM/graph_ensemble_learning |
| Open Datasets | Yes | Similarly to [Bevilacqua et al., 2021], we use five standard benchmark datasets [dat] to evaluate our approach. Table 1 shows the statistics of the datasets. AMR 2.0 and AMR 3.0 are divided into train, development and testing sets and we use them for in-distribution evaluation in Section 4.2. (...) AMR benchmark datasets. https://amr.isi.edu/download.html. |
| Dataset Splits | Yes | Table 1: Benchmark datasets. (...) For AMR 2.0 and 3.0, the models are trained on the training dataset, validated on the development dataset. We report results on testing sets in the in-distribution evaluation. |
| Hardware Specification | Yes | In all experiments, we used a Tesla GPU V100 for model training and used 8 CPUs for making an ensemble. |
| Software Dependencies | No | The paper mentions software components and models like BART, T5, ADAM optimization, and Stanford Core NLP, but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | The model is trained with 30 epochs. We use ADAM optimization with a learning rate of 1e-4 and a mini-batch size of 4. |