GUIDE: Real-Time Human-Shaped Agents
Authors: Lingyu Zhang, Zhengran Ji, Nicholas Waytowich, Boyuan Chen
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our human study involving 50 subjects offers strong quantitative and qualitative evidence of the effectiveness of our approach. |
| Researcher Affiliation | Collaboration | Lingyu Zhang1, Zhengran Ji1, Nicholas R Waytowich2, Boyuan Chen1 1Duke University, 2Army Research Laboratory |
| Pseudocode | No | The paper describes the algorithms and framework but does not include a formal pseudocode block or an explicitly labeled "Algorithm" section. |
| Open Source Code | Yes | We will also open-source the entire code base, including algorithms and task environments for the broader community for full reproducibility. |
| Open Datasets | Yes | We conduct our experiments on the CREW [51] platform. ... We will also open-source the entire code base, including algorithms and task environments for the broader community for full reproducibility. |
| Dataset Splits | Yes | To prevent overfitting, we held out 1 out of 5 trajectories as a validation set. |
| Hardware Specification | Yes | All human subject experiments are conducted on desktops with one NVIDIA RTX 4080 GPU. All evaluations are run on a headless server with 8 NVIDIA RTX A6000 and NVIDIA RTX 3090 Ti. |
| Software Dependencies | No | The paper mentions software components like "Adam optimizer", "DDPG", and "SAC" but does not specify version numbers for general software libraries or programming languages (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We used an Adam optimizer with a fixed learning rate of 1e-4 for RL policy training, with a discount factor of γ = 0.99. We applied gradient clipping setting max grad norm to 1. For the learned feedback model, we used the same Adam optimizer with 1e-4 learning rate and employed early stopping based on the loss on held-out trajectories. For Deep TAMER s credit assignment window, we used the same uniform [0.2, 4] distribution as in the original paper. We used a shorter window of [0.2, 1] for Find treasure and Hide-and-Seek. For these more difficult navigation tasks, we stacked three consecutive frames as input. |