Aligning LLM Agents by Learning Latent Preference from User Edits

Authors: Ge Gao, Alexey Taymanov, Eduardo Salinas, Paul Mineiro, Dipendra Misra

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce two interactive environments summarization and email writing, and use a GPT-4 simulated user for evaluation. On both tasks, CIPHER outperforms several baselines by achieving the lowest edit distance cost while only having a small overhead in LLM query cost over the base agent. Our analysis reports that user preferences learned by CIPHER show significant similarity to the ground truth latent preferences.
Researcher Affiliation Collaboration Ge Gao Alexey Taymanov Eduardo Salinas Paul Mineiro Dipendra Misra Department of Computer Science, Cornell University Microsoft Research New York
Pseudocode Yes Algorithm 1 CIPHER(ϕ, k, δ). A context representation function ϕ : X Rd, the retrieval hyperparameter k, and tolerance hyperparameter δ 0. We initialize history D = .
Open Source Code Yes Our code and data are publicly available at https://github.com/gao-g/prelude.
Open Datasets Yes We use documents from several existing sources listed in Table 1. These sources represent a diverse category of documents that a writing assistant would typically encounter (see Table 4 in Appendix for examples). In any given round, the user is provided a context that is a document from one of the sources for the given task. Table 4: Link to each source dataset, from which we randomly sample examples as the user-provided context in our tasks. CNN Daily Mail (See et al., 2017) https://huggingface.co/datasets/cnn_dailymail
Dataset Splits No Notably, there is no distinction between training and testing in our setting as every natural use of the agent yields an edit feedback for learning.
Hardware Specification No The paper states using GPT-4 as the base LLM but does not provide specific hardware details like GPU/CPU models or memory used for experiments.
Software Dependencies No We use GPT-4 as our base LLM for CIPHER and all baselines. We use Tiktoken tokenizer. We experiment with MPNET (Song et al., 2020) and BERT (Devlin et al., 2019) as our two context representation functions ϕ. No specific version numbers for software libraries or dependencies are provided.
Experiment Setup Yes For both tasks, we run an experiment for T = 200 rounds. We experiment with two different values of the number of retrieved examples k {1, 5}. For E-then-e LPI and Continual LPI we set Te = 5. We provide GPT-4 user prompt template and user edit examples in Appendix B. Prompt templates used by CIPHER are provided in Table 7.