reproducibilityindex.ai

Query-Dependent Prompt Evaluation and Optimization with Offline Inverse RL

Authors: Hao Sun, Alihan Hüyük, Mihaela van der Schaar

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental evaluations across various LLM scales and arithmetic reasoning datasets underscore both the efficacy and economic viability of the proposed approach.
Researcher Affiliation	Academia	Hao Sun , Alihan H uy uk, Mihaela van der Schaar DAMTP, University of Cambridge
Pseudocode	No	The paper describes the steps of its proposed solution (Prompt-OIRL) in detail but does not present them in a formalized pseudocode or algorithm block format.
Open Source Code	Yes	Code is available at: https://github.com/vanderschaarlab/Prompt-OIRL
Open Datasets	Yes	Tasks We use the tasks of Multi Arith (Roy & Roth, 2016), GSM8K (Cobbe et al., 2021a), SVAMP (Patel et al., 2021) in the arithmetic reasoning domain because they are widely studied in zero-shot prompting, and hence rich expert-crafted and machine-generated prompting knowledge is available. [...] All created offline demonstration datasets, including the query-prompt pairs, prompted answers from different LLMs, and the correctness of those answers will be released as a publicly accessible dataset.
Dataset Splits	No	The paper specifies training and testing splits for the datasets but does not explicitly detail a separate validation set with specific sizes or percentages for hyperparameter tuning.
Hardware Specification	Yes	With our implementation, conducting OIRL for the GSM8k takes 50 minutes on a Mac Book Air with an 8-core M2 chip, and takes only 5 minutes on a server with 16(out of 64)-core AMD 3995WX CPUs. [...] LLa MA2-7B-chat, which operated locally on an NVIDIA A4000 GPU.
Software Dependencies	No	The paper mentions using "XGBoost models" and references "Chen et al., 2015", but does not specify the version number of the XGBoost library or other relevant software dependencies like Python or specific deep learning frameworks used for the experiments.
Experiment Setup	Yes	To enhance replicability, we use the following hyper-parameters for the gradient boosting model Chen et al. (2015) in all experiment settings: param = { max_depth : 10, eta : 0.001, objective : binary:logistic }