reproducibilityindex.ai

Identifying the Risks of LM Agents with an LM-Emulated Sandbox

Authors: Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, Tatsunori Hashimoto

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We test both the tool emulator and evaluator through human evaluation and find that 68.8% of failures identified with Tool Emu would be valid real-world agent failures. Using our curated initial benchmark consisting of 36 high-stakes toolkits and 144 test cases, we provide a quantitative risk analysis of current LM agents and identify numerous failures with potentially severe outcomes.
Researcher Affiliation	Academia	Yangjun Ruan1,2 , Honghua Dong1,2 , Andrew Wang1,2, Silviu Pitis1,2, Yongchao Zhou1,2 Jimmy Ba1,2, Yann Dubois3, Chris J. Maddison1,2, Tatsunori Hashimoto3 1University of Toronto 2Vector Institute 3Stanford University
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks. It includes flowcharts and detailed prompts in the appendix, but not formal pseudocode.
Open Source Code	Yes	Our code is included in the supplementary material and will be open-sourced upon acceptance.
Open Datasets	No	The paper mentions 'curating an initial benchmark consisting of 36 high-stakes toolkits and 144 test cases' and 'Our final curated dataset consists of 144 test cases spanning 9 risk types', but does not provide concrete access information (link, DOI, specific repository, or formal citation with authors/year) for this dataset.
Dataset Splits	No	The paper states 'We randomly sampled a subset of 100 test cases from our curated dataset for validation' and that the 'final curated dataset consists of 144 test cases', but does not provide specific train/validation/test dataset split information (percentages, sample counts, or citations to predefined splits) needed for reproducibility.
Hardware Specification	No	The main experimental setup sections (4.1 and 5) do not provide specific hardware details (exact GPU/CPU models, memory amounts, or detailed computer specifications) used for running the experiments. While Appendix G.3 mentions an 'Amazon EC2 instance (t2.large) with 2 v CPUs and 8Gi B memory' for a specific real sandbox instantiation, this is not the general hardware specification for all reported experiments.
Software Dependencies	No	The paper mentions using specific LM models (e.g., 'GPT-4 (Open AI, 2023a) (gpt-4-0613)', 'Chat GPT-3.5 (Open AI, 2022) (gpt-3.5-turbo-16k-0613)', 'Claude-2 (Anthropic, 2023)', 'Vicuna-1.5 (Chiang et al., 2023) (vicuna-13b/7b-v1.5-16k)') and that 'The LM agent was implemented by Re Act (Yao et al., 2023b)', but it does not specify versions of general ancillary software or libraries (e.g., Python, PyTorch, TensorFlow) needed for replication.
Experiment Setup	Yes	For all emulators and evaluators, we employed GPT-4 with temperature=0. For better reproducibility, we used temperature=0.0 for all the components including the agents, emulators, and evaluators. The LM agent was implemented by Re Act (Yao et al., 2023b) and prompted with additional formatting instructions and examples (see Appx. E for details). We studied the effect of prompting by incorporating certain safety requirements, such as the LM agent should be aware of the potential risks and seek user confirmation before executing risky actions (see Table E.1), into the prompt (denoted as Safety ) of GPT-4.