Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CREW: Facilitating Human-AI Teaming Research

Authors: Lingyu Zhang, Zhengran Ji, Boyuan Chen

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	With CREW, we were able to conduct 50 human subject studies within a week to verify the effectiveness of our benchmark. We demonstrate CREW s potential by benchmarking real-time human-guided reinforcement learning (RL) algorithms alongside various RL baselines.
Researcher Affiliation	Academia	Lingyu Zhang EMAIL Duke University Zhengran Ji EMAIL Duke University Boyuan Chen EMAIL Duke University
Pseudocode	Yes	Algorithm 1 The c-Deep TAMER algorithm.
Open Source Code	Yes	Our fully open-sourced code base and detailed documentations can be found at https://github.com/ generalroboticslab/CREW.git.
Open Datasets	No	The paper describes a platform for conducting human-AI teaming research and collecting data, but it does not provide concrete access information (link, DOI, repository, or formal citation with authors/year) for any specific publicly available datasets used or generated by their experiments.
Dataset Splits	No	The paper describes evaluation procedures for checkpoints, such as evaluating for "1 game (10 rolls)" or "100 episodes" on "unseen test environments", but it does not provide specific dataset split information (percentages, sample counts, or detailed methodology) for a larger dataset into training, validation, and test sets.
Hardware Specification	Yes	All human subject experiments were conducted on desktops with one NVIDIA RTX 4080 GPU. All evaluations were run on a headless server with 8 NVIDIA RTX A6000 and NVIDIA RTX 3090 Ti.
Software Dependencies	Yes	The environments of CREW is implemented using Unity 2021.3.24f1, with packages ML Agents 2.3.0-exp.3 Juliani et al. (2018), Netcode for Game Objects 1.3. net and Nakama Unity 3.6.0. nak. Algorithms are developed with torchrl 0.3.0 Bou et al. (2023).
Experiment Setup	Yes	The hyperparameter settings for our experiments is summarized in Table. 4. Table 4: Hyperparameters c-Deep TAMER DDPG SAC γ 0.99 0.99 0.99 learning rate 1e-4 1e-4 1e-4 max_grad_norm 0.1 0.1 0.1 batch size 16 240 240 frames per batch 8 240 240 alpha_init 0.1 target entropy -6.0 actor scale_lb 1e-4 # Q value nets 2 2 target update polyak 0.995 0.995 0.995 actor exploration noise N(0, 0.1) N(0, 0.1) credit assignment window bowling[0.2, 4], others[0.2, 1] -