Efficient Minimum Bayes Risk Decoding using Low-Rank Matrix Completion Algorithms
Authors: Firas Trabelsi, David Vilar, Mara Finkelstein, Markus Freitag
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results on machine translation tasks demonstrate that the proposed method requires 1/16 utility metric computations compared to vanilla MBR decoding while achieving equal translation quality measured by COMET22 on the WMT22 dataset (en de, en ru). |
| Researcher Affiliation | Industry | Firas Trabelsi Google firast@google.com David Vilar Google vilar@google.com Mara Finkelstein Google marafin@google.com Markus Freitag Google freitag@google.com |
| Pseudocode | Yes | Algorithm 1 ALS for Matrix Completion, Algorithm 2 PMBR: MBR Approximation using ALS |
| Open Source Code | Yes | We share the code of all the approximation methods used including our new algorithm. We also use an open dataset WMT22. We are working on clearing internal Google policies to release the generated data and scores. The PMBR algorithm code is submitted as supplemental. |
| Open Datasets | Yes | We run experiments using the WMT 2022 test sets for English German (en de) and English Russian (en ru). The official WMT test sets (Kocmi et al., 2022) are split into sentences but come with document information. |
| Dataset Splits | Yes | We run experiments using the WMT 2022 test sets for English German (en de) and English Russian (en ru). The official WMT test sets (Kocmi et al., 2022) are split into sentences but come with document information. We minimize this loss per language pair on 10 examples that we hold from the data generated with the WMT 2022 datasets with this search space... |
| Hardware Specification | Yes | For reference, 30 steps of the ALS algorithm with r = 10 running on a CPU takes on average 0.2 seconds to run while the Metric X inference takes 3.4 seconds on a TPUv4 platform. Samples Generation: We used around 500 TPUv5 for 5 hours per language pair to generate the samples. Metric X pairwise computations: We used around 2000 TPUv4 for 24 hours per language pair to compute all the scores. Scoring simulations: These were run on CPUs in parallel on a cluster of 1000 machines. |
| Software Dependencies | No | The paper mentions specific tools and models like Metric X, COMET22, and PaLM8B, but does not provide specific version numbers for the software dependencies or libraries used to implement and run the experiments. |
| Experiment Setup | Yes | The ALS algorithm has three hyperparameters: λ, r, and n as described in Algorithm 1. We perform a grid search to optimize these hyperparameters, setting our loss function to be the accuracy with respect to the vanilla MBR method... with this search space {λ {0.1, 0.15, 0.2}} {r {5, 6, . . . 14, 15}} {n {10, 11, . . . , 29, 30}} and We use Metric X (Juraska et al., 2023) as the utility function for all variants of MBR decoding and sample 1024 examples for each sentence using epsilon sampling with ε = 0.02 (Freitag et al., 2023a) and using 3-shot prompting with examples taken from the FLORES corpus (Guzmán et al., 2019). |