Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
HAZARD Challenge: Embodied Decision Making in Dynamically Changing Environments
Authors: Qinhong Zhou, Sunli Chen, Yisong Wang, Haozhe Xu, Weihua Du, Hongxin Zhang, Yilun Du, Joshua B. Tenenbaum, Chuang Gan
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 EXPERIMENTS, 5.1 EXPERIMENTAL SETUP, 5.2 BASELINES, 5.3 EXPERIMENTAL RESULTS, Table 1: The rescued value rate (Value), averaged rescue step (Step), and averaged damaged rate (Damage) of the proposed LLM pipeline (LLM) and all baseline methods. |
| Researcher Affiliation | Collaboration | Qinhong Zhou1 , Sunli Chen2 , Yisong Wang3, Haozhe Xu3, Weihua Du2, Hongxin Zhang1, Yilun Du4, Joshua B. Tenenbaum4, Chuang Gan1,5 1University of Massachusetts Amherst, 2 Institute for Interdisciplinary Information Sciences, Tsinghua University, 3Peking University, 4MIT, 5MIT-IBM Watson AI Lab |
| Pseudocode | No | The paper describes algorithms like A* and MCTS within the text, but it does not provide any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | For readers interested in reproducing the experimental results presented in this paper, we have made our experiments accessible via a Github repository, available at https://github.com/UMass-Foundation-Model/HAZARD. |
| Open Datasets | Yes | HAZARD is available at https://vis-www.cs.umass.edu/hazard/ and To create the dataset for HAZARD, we choose 4 distinct indoor rooms for the fire and flood tasks, and 4 outdoor regions for the wind task. |
| Dataset Splits | No | The paper states a 'train-set split ratio of 3:1' but does not explicitly mention a separate validation split or its details. |
| Hardware Specification | Yes | We run most of our experiments on an Intel i9-9900k CPU and RTX2080-Super GPU Desktop. |
| Software Dependencies | No | The paper mentions 'Open MMLab detection framework' and 'Mask-RCNN' but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | We use max tokens of 512, temperature of 0.7, top p of 1.0 as hyper-parameters during inference. and We use the PPO algorithm with learning rate 2.5 ˆ 10 4 and train for 105 steps. |