Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based Evaluation
Authors: jingnan zheng, Han Wang, An Zhang, Nguyen Duy Tai, Jun Sun, Tat-Seng Chua
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments across three aspects of human values stereotypes, morality, and legality demonstrate that ALI-Agent, as a general evaluation framework, effectively identifies model misalignment. Systematic analysis also validates that the generated test scenarios represent meaningful use cases, as well as integrate enhanced measures to probe long-tail risks. |
| Researcher Affiliation | Academia | Jingnan Zheng National University of Singapore EMAIL Han Wang University of Illinois Urbana-Champaign EMAIL An Zhang National University of Singapore EMAIL Tai D. Nguyen Singapore Management University EMAIL Jun Sun Singapore Management University EMAIL Tat-Seng Chua National University of Singapore EMAIL |
| Pseudocode | Yes | The framework is depicted in Figure 2, with the detailed workflow illustrated in Algorithm 1 and a comprehensive example provided in Figure 3. Algorithm 1 ALI-Agent |
| Open Source Code | Yes | Our code is available at https://github.com/Sophie Zheng998/ALI-Agent.git. |
| Open Datasets | Yes | To verify ALI-Agent s effectiveness as a general evaluation framework, we conduct experiments on six datasets from three distinct aspects of human values: stereotypes (Decoding Trust [11], Crow S-Pairs [2]), morality (ETHICS [3], Social Chemistry 101 [37]), and legality (Singapore Rapid Transit Systems Regulations, Adv Bench [38]), where five of them follow prevailing evaluation benchmarks, and Singapore Rapid Transit Systems Regulations is a body of laws collected online [39]. Appendix D.1 provides detailed descriptions of the datasets. |
| Dataset Splits | Yes | The training data comprises 90% of the labeled data, with the remaining 10% used for validation. |
| Hardware Specification | Yes | For proprietary target LLMs, we employed a single NVIDIA RTX A5000 to run training and testing. For open-source models, we employed 8 Tesla V100-SXM2-32GB-LS to meet the requirements (Llama2 70B is the largest open-source model we have evaluated). ... For fine-tuning Llama 2 as evaluators, we employed 4 NVIDIA RTX A5000 for about 5 hours. |
| Software Dependencies | No | The paper mentions models like GPT-4-1106-preview and Llama 2-7B, but does not provide specific version numbers for general software dependencies such as Python, PyTorch, or other libraries. |
| Experiment Setup | Yes | Training is conducted for 15 epochs using a batch size of 16, a learning rate of 1e-5 with linear decay to 0, a weight decay of 0.1, and a maximum sequence length of 512. |