reproducibilityindex.ai

Efficient Minimum Bayes Risk Decoding using Low-Rank Matrix Completion Algorithms

Authors: Firas Trabelsi, David Vilar, Mara Finkelstein, Markus Freitag

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results on machine translation tasks demonstrate that the proposed method requires 1/16 utility metric computations compared to vanilla MBR decoding while achieving equal translation quality measured by COMET22 on the WMT22 dataset (en de, en ru).
Researcher Affiliation	Industry	Firas Trabelsi Google firast@google.com David Vilar Google vilar@google.com Mara Finkelstein Google marafin@google.com Markus Freitag Google freitag@google.com
Pseudocode	Yes	Algorithm 1 ALS for Matrix Completion, Algorithm 2 PMBR: MBR Approximation using ALS
Open Source Code	Yes	We share the code of all the approximation methods used including our new algorithm. We also use an open dataset WMT22. We are working on clearing internal Google policies to release the generated data and scores. The PMBR algorithm code is submitted as supplemental.
Open Datasets	Yes	We run experiments using the WMT 2022 test sets for English German (en de) and English Russian (en ru). The ofﬁcial WMT test sets (Kocmi et al., 2022) are split into sentences but come with document information.
Dataset Splits	Yes	We run experiments using the WMT 2022 test sets for English German (en de) and English Russian (en ru). The ofﬁcial WMT test sets (Kocmi et al., 2022) are split into sentences but come with document information. We minimize this loss per language pair on 10 examples that we hold from the data generated with the WMT 2022 datasets with this search space...
Hardware Specification	Yes	For reference, 30 steps of the ALS algorithm with r = 10 running on a CPU takes on average 0.2 seconds to run while the Metric X inference takes 3.4 seconds on a TPUv4 platform. Samples Generation: We used around 500 TPUv5 for 5 hours per language pair to generate the samples. Metric X pairwise computations: We used around 2000 TPUv4 for 24 hours per language pair to compute all the scores. Scoring simulations: These were run on CPUs in parallel on a cluster of 1000 machines.
Software Dependencies	No	The paper mentions specific tools and models like Metric X, COMET22, and PaLM8B, but does not provide specific version numbers for the software dependencies or libraries used to implement and run the experiments.
Experiment Setup	Yes	The ALS algorithm has three hyperparameters: λ, r, and n as described in Algorithm 1. We perform a grid search to optimize these hyperparameters, setting our loss function to be the accuracy with respect to the vanilla MBR method... with this search space {λ {0.1, 0.15, 0.2}} {r {5, 6, . . . 14, 15}} {n {10, 11, . . . , 29, 30}} and We use Metric X (Juraska et al., 2023) as the utility function for all variants of MBR decoding and sample 1024 examples for each sentence using epsilon sampling with ε = 0.02 (Freitag et al., 2023a) and using 3-shot prompting with examples taken from the FLORES corpus (Guzmán et al., 2019).