Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Position: Towards Unified Alignment Between Agents, Humans, and Environment

Authors: Zonghan Yang, An Liu, Zijun Liu, Kaiming Liu, Fangzhou Xiong, Yile Wang, Zeyuan Yang, Qingyuan Hu, Xinrui Chen, Zhenhe Zhang, Fuwen Luo, Zhicheng Guo, Peng Li, Yang Liu

ICML 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We also conduct proof-of-concept studies by introducing realistic features to Web Shop (Yao et al., 2022a)... We then follow the principles of UA2 to propose an initial design of our agent and benchmark its performance with several candidate baselines in the retrofitted Web Shop. The extensive experimental results further prove the importance of the principles of UA2.
Researcher Affiliation	Collaboration	1Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China 2Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China 3Jiangsu Collaborative Innovation Center for Language Competence, Jiangsu, China.
Pseudocode	Yes	The algorithms of collaborative filtering and DPP-based reranking are briefed in Algorithms 1 and 2, respectively.
Open Source Code	Yes	The code implementation of the retrofitted Web Shop environment can be found at https://github.com/AgentForceTeamOfficial/UA2-WebShop The code implementation of our agent design can be found at https://github.com/AgentForceTeamOfficial/UA2-Agent
Open Datasets	Yes	Web Shop is a simulated online shopping en-vironment with 1.18M real-world shopping items gathered from Amazon, and 12,087 textual shopping instructions collected from human annotators.
Dataset Splits	Yes	We evaluate our method and baseline methods across all 10 users on our retrofitted Webshop, each comprising 50 tasks, except for LATS which is evaluated with only one user due to its high cost. All methods were tested in each of the following three environments respectively: The fully retrofitted environment... The ablated environment that excludes human intentions... The ablated environment that excludes environmental dynamics...
Hardware Specification	No	No specific hardware details (like GPU/CPU models, memory, or cloud instance types) used for running the experiments were provided.
Software Dependencies	Yes	We employed Chat GPT (gpt-3.5-turbo-1106) as an assistant to simulate 30 different users and gather their preference data... All methods utilize gpt-3.5-turbo-instruct-0914 as the underlying model for their agents except for LATS where we utilize gpt-3.5-turbo-1106 to keep the same setting as the original paper.
Experiment Setup	Yes	In executing each task, we limited the interaction with the web to a maximum of 15 steps per task, inclusive of any invalid actions. For Re Act, Reflexion, and our method, we set the temperature as 0.0. For Re Act-SC, we set the number of samples k to be 3 and the temperature to be 0.05... For Co T and Co T-L2M, we also set the temperature as 0.0; and for Co T-SC, we also set k = 3 and the temperature to be 0.05. To adhere to the same settings with (Zhou et al., 2023a), we set the temperature to be 1.0, k to be 5, the number of iterations n to be 30 for LATS.