ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based Evaluation
Authors: jingnan zheng, Han Wang, An Zhang, Nguyen Duy Tai, Jun Sun, Tat-Seng Chua
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments across three aspects of human values stereotypes, morality, and legality demonstrate that ALI-Agent, as a general evaluation framework, effectively identifies model misalignment. Systematic analysis also validates that the generated test scenarios represent meaningful use cases, as well as integrate enhanced measures to probe long-tail risks. |
| Researcher Affiliation | Academia | Jingnan Zheng National University of Singapore jingnan.zheng@u.nus.edu Han Wang University of Illinois Urbana-Champaign hanw14@illinois.edu An Zhang National University of Singapore anzhang@u.nus.edu Tai D. Nguyen Singapore Management University dtnguyen.2019@smu.edu.sg Jun Sun Singapore Management University junsun@smu.edu.sg Tat-Seng Chua National University of Singapore dcscts@nus.edu.sg |
| Pseudocode | Yes | The framework is depicted in Figure 2, with the detailed workflow illustrated in Algorithm 1 and a comprehensive example provided in Figure 3. Algorithm 1 ALI-Agent |
| Open Source Code | Yes | Our code is available at https://github.com/Sophie Zheng998/ALI-Agent.git. |
| Open Datasets | Yes | To verify ALI-Agent s effectiveness as a general evaluation framework, we conduct experiments on six datasets from three distinct aspects of human values: stereotypes (Decoding Trust [11], Crow S-Pairs [2]), morality (ETHICS [3], Social Chemistry 101 [37]), and legality (Singapore Rapid Transit Systems Regulations, Adv Bench [38]), where five of them follow prevailing evaluation benchmarks, and Singapore Rapid Transit Systems Regulations is a body of laws collected online [39]. Appendix D.1 provides detailed descriptions of the datasets. |
| Dataset Splits | Yes | The training data comprises 90% of the labeled data, with the remaining 10% used for validation. |
| Hardware Specification | Yes | For proprietary target LLMs, we employed a single NVIDIA RTX A5000 to run training and testing. For open-source models, we employed 8 Tesla V100-SXM2-32GB-LS to meet the requirements (Llama2 70B is the largest open-source model we have evaluated). ... For fine-tuning Llama 2 as evaluators, we employed 4 NVIDIA RTX A5000 for about 5 hours. |
| Software Dependencies | No | The paper mentions models like GPT-4-1106-preview and Llama 2-7B, but does not provide specific version numbers for general software dependencies such as Python, PyTorch, or other libraries. |
| Experiment Setup | Yes | Training is conducted for 15 epochs using a batch size of 16, a learning rate of 1e-5 with linear decay to 0, a weight decay of 0.1, and a maximum sequence length of 512. |