reproducibilityindex.ai

Reranking Laws for Language Generation: A Communication-Theoretic Perspective

Authors: António Farinhas, Haau-Sing Li, André Martins

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We use our framework to obtain reranking laws which we validate empirically on two real-world tasks using LLMs: text-to-code generation with Deep Seek-Coder 7B and machine translation of medical data with Tower Instruct 13B.
Researcher Affiliation	Collaboration	1Instituto Superior Técnico, Universidade de Lisboa 2Instituto de Telecomunicações 3Ubiquitous Knowledge Processing Lab, TU Darmstadt 4ELLIS Unit Lisbon 5Unbabel {antonio.farinhas,andre.t.martins}@tecnico.ulisboa.pt, hli@ukp.tu-darmstadt.de
Pseudocode	No	The paper describes algorithms and models in text and mathematical formulas but does not include any dedicated pseudocode blocks or figures.
Open Source Code	Yes	Our code is available at https://github.com/deep-spin/reranking-laws.
Open Datasets	Yes	We use a sanitized version of the MBPP dataset (Austin et al., 2021; Liu et al., 2023)... We use the TICO-19 dataset (Anastasopoulos et al., 2020)... We use samples generated by Aggarwal et al. (2023) ... and report results on the SVAMP (Patel et al., 2021) and Strategy QA (Geva et al., 2021) datasets.
Dataset Splits	Yes	We split the dataset in two equally sized parts to get development and test splits. (MBPP dataset) ... We use the official splits, which contain 971 examples for development and 2100 for testing (TICO-19 dataset) ... We split the datasets in two equally sized parts to get development and test splits. (SVAMP and Strategy QA datasets)
Hardware Specification	Yes	Our insfrastructure consists of 2 machines, each equipped with 8 NVIDIA RTX A6000 GPUs (46GB) and 12 Intel Xeon Gold 6348 CPUs (2.60GHz, 1TB RAM).
Software Dependencies	No	The paper mentions using "Deep Seek-Coder 7B" and "Tower Instruct 13B" (which are models, not general software dependencies with version numbers), and "scipy.optimize.least_squares" without a version number. It does not provide specific version numbers for software libraries or environments required for reproducibility.
Experiment Setup	Yes	We generate 200 hypotheses with Deep Seek-Coder 7B (Guo et al., 2024) using a sampling temperature of 1... we sample 50 translation hypotheses with a temperature of 1 from Tower Instruct 13B... we fit all curves on the development set using least squares (Ghorbani et al., 2022, App. E)