Navigation Turing Test (NTT): Learning to Evaluate Human-Like Navigation
Authors: Sam Devlin, Raluca Georgescu, Ida Momennejad, Jaroslaw Rzepecki, Evelyn Zuniga, Gavin Costello, Guy Leroy, Ali Shaw, Katja Hofmann
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of our automated NTT on a navigation task in a complex 3D environment. We investigate six classifcation models to shed light on the types of architectures best suited to this task, and validate them against data collected through a human NTT. Our best models achieve high accuracy when distinguishing true human and agent behavior. |
| Researcher Affiliation | Industry | 1Microsoft Research, Cambridge, UK 2Microsoft Research, New York, NY, USA 3Ninja Theory, Cambridge, UK. |
| Pseudocode | No | The paper includes figures illustrating model architectures but does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | CNN. Convolutional models are applied to image input (visual, top-down and bar-code observations). We use a VGG-16 (Simonyan & Zisserman, 2014) pre-trained on Imagenet (Deng et al., 2009) to extract visual features. |
| Dataset Splits | Yes | A total of 140 human recordings were collected, which we split into 100 videos (from 4 players) for classifer training and validation, and 40 (3 remaining players) for testing. Training and hyperparameter tuning was performed using 5-fold cross validation on trajectories generated by agent checkpoints and human players that were fully sepa rate from those that generated test data. |
| Hardware Specification | No | The paper mentions that “recorded replays on machines that met the system requirements of the experimental game build, including GPU rendering support” and agent training on “60 parallel game instances”, but does not provide specific hardware details such as GPU/CPU models or memory. |
| Software Dependencies | No | The paper mentions software like “Tensorflow (Abadi et al., 2015)” and “PPO (Schulman et al., 2017)” and “VGG-16 (Simonyan & Zisserman, 2014)” but does not specify exact version numbers for these software libraries or models used in their experiments. |
| Experiment Setup | Yes | The agents were trained using PPO (Schulman et al., 2017)... The reward signal during training consists of a dense reward for minimizing the distance..., a +1 reward for reaching the target, and a -1 penalty for dying... a small per-step penalty of 0.01 encourages effcient task completion. Episodes end when agents reach the goal radius or after 3,000 game ticks... Training and hyperparameter tuning was performed using 5-fold cross validation... See Appendix A.1 for training details and hyperparameters. |