Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Structured Prediction as Translation between Augmented Natural Languages

Authors: Giovanni Paolini, Ben Athiwaratkun, Jason Krone, Jie Ma, Alessandro Achille, RISHITA ANUBHAI, Cicero Nogueira dos Santos, Bing Xiang, Stefano Soatto

ICLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We propose a new framework, Translation between Augmented Natural Languages (TANL), to solve many structured prediction language tasks including joint entity and relation extraction, nested named entity recognition, relation classiﬁcation, semantic role labeling, event extraction, coreference resolution, and dialogue state tracking. ... Our approach can match or outperform task-speciﬁc models on all tasks, and in particular, achieves new state-of-the-art results on joint entity and relation extraction (Co NLL04, ADE, NYT, and ACE2005 datasets), relation classiﬁcation (Few Rel and TACRED), and semantic role labeling (Co NLL-2005 and Co NLL2012). ... 5 EXPERIMENTS In this section, we show that our TANL framework, with the augmented natural languages outlined in Section 4, can effectively solve the structured prediction tasks considered and exceeds the previous state of the art on multiple datasets.
Researcher Affiliation	Industry	Giovanni Paolini, Ben Athiwaratkun, Jason Krone, Jie Ma, Alessandro Achille, Rishita Anubhai, Cicero Nogueira dos Santos, Bing Xiang, Stefano Soatto Amazon Web Services EMAIL
Pseudocode	No	The paper describes steps for decoding structured objects but does not contain a formal 'Pseudocode' or 'Algorithm' block.
Open Source Code	Yes	The code is available at https://github.com/amazon-research/tanl.
Open Datasets	Yes	Datasets. We experiment on the following datasets: Co NLL04 (Roth & Yih, 2004), ADE (Gurulingappa et al., 2012), NYT (Riedel et al., 2010), and ACE2005 (Walker et al., 2006).
Dataset Splits	Yes	The Co NLL04 dataset... we use the training (922 sentences), validation (231 sentences), and test (288 sentences) split by Gupta et al. (2016). ... The NYT dataset ... It consists of 56,195 sentences for training, 5,000 for validation, and 5,000 for testing. ... The English Onto Notes dataset... consists of 59,924 sentences for training, 8,528 for validation, and 8,262 for testing.
Hardware Specification	Yes	We use: 8 V100 GPUs with a batch size of 8 per GPU;
Software Dependencies	No	The paper mentions 'a pre-trained T5-base model (Raffel et al., 2019)' and 'the implementation of Hugging Face s Transformers library (Wolf et al., 2019)' but does not provide specific version numbers for these software components or any other libraries.
Experiment Setup	Yes	To keep our framework as simple as possible, hyperparameters are the same across the majority of our experiments. We use: 8 V100 GPUs with a batch size of 8 per GPU; the Adam W optimizer (Kingma & Ba, 2015; Loshchilov & Hutter, 2019); linear learning rate decay starting from 0.0005; maximum input/output sequence length equal to 256 tokens at training time (longer sequences are truncated), except for relation classiﬁcation, coreference resolution, and dialogue state tracking (see below). The number of ﬁne-tuning epochs is adjusted depending on the size of the dataset, as described later.