Aligning LLM Agents by Learning Latent Preference from User Edits
Authors: Ge Gao, Alexey Taymanov, Eduardo Salinas, Paul Mineiro, Dipendra Misra
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We introduce two interactive environments summarization and email writing, and use a GPT-4 simulated user for evaluation. On both tasks, CIPHER outperforms several baselines by achieving the lowest edit distance cost while only having a small overhead in LLM query cost over the base agent. Our analysis reports that user preferences learned by CIPHER show significant similarity to the ground truth latent preferences. |
| Researcher Affiliation | Collaboration | Ge Gao Alexey Taymanov Eduardo Salinas Paul Mineiro Dipendra Misra Department of Computer Science, Cornell University Microsoft Research New York |
| Pseudocode | Yes | Algorithm 1 CIPHER(ϕ, k, δ). A context representation function ϕ : X Rd, the retrieval hyperparameter k, and tolerance hyperparameter δ 0. We initialize history D = . |
| Open Source Code | Yes | Our code and data are publicly available at https://github.com/gao-g/prelude. |
| Open Datasets | Yes | We use documents from several existing sources listed in Table 1. These sources represent a diverse category of documents that a writing assistant would typically encounter (see Table 4 in Appendix for examples). In any given round, the user is provided a context that is a document from one of the sources for the given task. Table 4: Link to each source dataset, from which we randomly sample examples as the user-provided context in our tasks. CNN Daily Mail (See et al., 2017) https://huggingface.co/datasets/cnn_dailymail |
| Dataset Splits | No | Notably, there is no distinction between training and testing in our setting as every natural use of the agent yields an edit feedback for learning. |
| Hardware Specification | No | The paper states using GPT-4 as the base LLM but does not provide specific hardware details like GPU/CPU models or memory used for experiments. |
| Software Dependencies | No | We use GPT-4 as our base LLM for CIPHER and all baselines. We use Tiktoken tokenizer. We experiment with MPNET (Song et al., 2020) and BERT (Devlin et al., 2019) as our two context representation functions ϕ. No specific version numbers for software libraries or dependencies are provided. |
| Experiment Setup | Yes | For both tasks, we run an experiment for T = 200 rounds. We experiment with two different values of the number of retrieved examples k {1, 5}. For E-then-e LPI and Continual LPI we set Te = 5. We provide GPT-4 user prompt template and user edit examples in Appendix B. Prompt templates used by CIPHER are provided in Table 7. |