Efficient Minimum Bayes Risk Decoding using Low-Rank Matrix Completion Algorithms

Authors: Firas Trabelsi, David Vilar, Mara Finkelstein, Markus Freitag

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results on machine translation tasks demonstrate that the proposed method requires 1/16 utility metric computations compared to vanilla MBR decoding while achieving equal translation quality measured by COMET22 on the WMT22 dataset (en de, en ru).
Researcher Affiliation Industry Firas Trabelsi Google firast@google.com David Vilar Google vilar@google.com Mara Finkelstein Google marafin@google.com Markus Freitag Google freitag@google.com
Pseudocode Yes Algorithm 1 ALS for Matrix Completion, Algorithm 2 PMBR: MBR Approximation using ALS
Open Source Code Yes We share the code of all the approximation methods used including our new algorithm. We also use an open dataset WMT22. We are working on clearing internal Google policies to release the generated data and scores. The PMBR algorithm code is submitted as supplemental.
Open Datasets Yes We run experiments using the WMT 2022 test sets for English German (en de) and English Russian (en ru). The official WMT test sets (Kocmi et al., 2022) are split into sentences but come with document information.
Dataset Splits Yes We run experiments using the WMT 2022 test sets for English German (en de) and English Russian (en ru). The official WMT test sets (Kocmi et al., 2022) are split into sentences but come with document information. We minimize this loss per language pair on 10 examples that we hold from the data generated with the WMT 2022 datasets with this search space...
Hardware Specification Yes For reference, 30 steps of the ALS algorithm with r = 10 running on a CPU takes on average 0.2 seconds to run while the Metric X inference takes 3.4 seconds on a TPUv4 platform. Samples Generation: We used around 500 TPUv5 for 5 hours per language pair to generate the samples. Metric X pairwise computations: We used around 2000 TPUv4 for 24 hours per language pair to compute all the scores. Scoring simulations: These were run on CPUs in parallel on a cluster of 1000 machines.
Software Dependencies No The paper mentions specific tools and models like Metric X, COMET22, and PaLM8B, but does not provide specific version numbers for the software dependencies or libraries used to implement and run the experiments.
Experiment Setup Yes The ALS algorithm has three hyperparameters: λ, r, and n as described in Algorithm 1. We perform a grid search to optimize these hyperparameters, setting our loss function to be the accuracy with respect to the vanilla MBR method... with this search space {λ {0.1, 0.15, 0.2}} {r {5, 6, . . . 14, 15}} {n {10, 11, . . . , 29, 30}} and We use Metric X (Juraska et al., 2023) as the utility function for all variants of MBR decoding and sample 1024 examples for each sentence using epsilon sampling with ε = 0.02 (Freitag et al., 2023a) and using 3-shot prompting with examples taken from the FLORES corpus (Guzmán et al., 2019).