Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Dependency Parsing is More Parameter-Efficient with Normalization

Authors: Paolo Gajo, Domenic Rosati, Hassan Sajjad, Alberto Barrón-Cedeño

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we provide theoretical evidence and empirical results revealing that a lack of normalization necessarily results in overparameterized parser models, where the extra parameters compensate for the sharp softmax outputs produced by high variance inputs to the biaffine scoring function. We argue that biaffine scoring can be made substantially more efficient by performing score normalization. We conduct experiments on semantic and syntactic dependency parsing in multiple languages, along with latent graph inference on non-linguistic data, using various settings of a k-hop parser. We train N-layer stacked Bi LSTMs and evaluate the parser s performance with and without normalizing biaffine scores.
Researcher Affiliation	Academia	Paolo Gajo University of Bologna EMAIL Domenic Rosati Dalhousie University EMAIL Hassan Sajjad Dalhousie University EMAIL Alberto Barrón-Cedeño University of Bologna EMAIL
Pseudocode	No	The paper describes the methodology and architecture in detail using textual descriptions and diagrams (e.g., Figure 2: Dependency parsing diagram) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code: https://github.com/paolo-gajo/Efficient SDP
Open Datasets	Yes	To train and evaluate our models, we use four Sem DP datasets,1 along with the 2.2 version of the Universal Dependencies English EWT treebank (en EWT) [24] and Sci DTB [48] for Syn DP. Table 1 summarizes datasets statistics and reports their entity and relation class annotations. As regards Sem DP, ADE [12] is a medical-domain dataset comprising reports of drug adverse-effect reactions. ... Co NLL04 [35] contains news texts ... Sci ERC [21] is a dataset compiled from sentences extracted from artificial intelligence literature. ... For ADE, Co NLL04, and Sci ERC we use the splits provided in [4].2 https://drive.google.com/drive/folders/1vVKJIUzK4hIipfdEGmS0CCoFmUmZwOQV As regards Syn DP, we use en EWT [24] to compare directly against [13] s results. ... Instead, we use Sci DTB [48], a discourse analysis dataset ... In Appendix A we also carry out Syn DP experiments using Universal Dependencies [25] datasets in six other languages. ... Finally, we also conduct experiments on three non-linguistic datasets: PCQM-Contact [10], CIFAR10 Superpixel [9], and QM9 [33].
Dataset Splits	Yes	Table 1: Size of the train/dev/test splits and entity/relation classes for each dataset. Data Entities Relations ADE [12] 2,563 / 854 / 300 disease, drug adverse Effect Co NLL04 [35] 922 / 231 / 288 organization, person, location kill, located In, work For, org Based In, live In Sci ERC [21] 1,366 / 187 / 397 generic, material, method, metric, other Sci Term, task used For, feature Of, hyponym Of, evaluate For, part Of, compare, conjunction ERFGC [46] 242 / 29 / 29 food, tool, duration, quantity, action By Chef, discont Action, action By Food, action By Tool, food State, tool State en EWT [24] 10,098 / 1,431 / 1,427 x POS tags UD relations Sci DTB [48] 2,567 / 814 / 817 x POS tags UD relations
Hardware Specification	Yes	We ran all of our experiments on a cluster of NVIDIA H100 (96GB of VRAM) and NVIDIA L40 (48GB of VRAM) GPUs, one run per single GPU.
Software Dependencies	No	The paper mentions using specific models like BERTbase [6] and optimizers like Adam W [18], but does not provide explicit version numbers for these software components or underlying libraries such as Python or PyTorch.
Experiment Setup	Yes	We experiment with a range of hyperparameters for the encoder, tagger, and parser, as listed in Table 2. We use BERTbase [6] as our pre-trained encoder, which we keep frozen throughout the whole training run in our main setting. As regards the tagger, we set Lϕ 1 and hϕ 100, as in [3], with weights initialized with a Xavier uniform distribution. We train our models for 2k steps and evaluate on the development partition of each dataset every 100 steps. ... We set the learning rate at η 1 ˆ 10 3 when the encoder is kept frozen. In all settings, including ablations, we use Adam W [18] as the optimizer and a batch size of 8.