Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Measuring and Reducing Model Update Regression in Structured Prediction for NLP
Authors: Deng Cai, Elman Mansimov, Yi-An Lai, Yixuan Su, Lei Shu, Yi Zhang
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We carry out a series of experiments to examine the severity of model update regression under various model update scenarios. Experiments show that BCR can better mitigate model update regression than model ensemble and knowledge distillation approaches. |
| Researcher Affiliation | Collaboration | Deng Cai The Chinese University of Hong Kong EMAIL Elman Mansimov Amazon AWS AI Labs EMAIL Yi-An Lai Amazon AWS AI Labs EMAIL Yixuan Su University of Cambridge EMAIL Lei Shu Amazon AWS AI Labs EMAIL Yi Zhang Amazon AWS AI Labs EMAIL |
| Pseudocode | No | The paper describes its methods and approaches in natural language and mathematical formulas, but it does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | We will release our code upon acceptance. |
| Open Datasets | Yes | We use the English EWT treebank from the Universal Dependency (UD2.2) Treebanks.2 We use the TOP dataset (Gupta et al., 2018) for our experiments. |
| Dataset Splits | Yes | We adopt the standard training/dev/test splits and use the universal POS tags (Petrov et al., 2012) provided in the treebank. |
| Hardware Specification | Yes | With the same inference hardware (one Nvidia V100 GPU) and the same batch size of 32, the decoding and re-ranking speeds of deepbiaf are 171 and 244 sentences per second, and 64 and 221 sentences per second for stackptr. |
| Software Dependencies | No | The paper mentions using NeuroNLP2, Fairseq, and Hugging Face for implementing models, but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | For BCR, various decoding methods are explored for candidate generation. Specifically, we use k-best spanning trees algorithm... and beam search... We also explore sampling-based decoding methods such as top-k sampling (k ∈ {5, 10, 50, 100}), top-p sampling (p ∈ {0.95, 0.90, 0.85, 0.80}), and dropout-p sampling (p ∈ {0.1, 0.2, 0.3, 0.4}). The number of candidates in BCR is set to 10. |