Forecasting Human Trajectory from Scene History

Authors: Mancheng Meng, Ziyan Wu, Terrence Chen, Xiran Cai, Xiang Zhou, Fan Yang, Dinggang Shen

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive evaluations to validate the efficacy of our proposed framework on ETH, UCY, as well as a new, challenging benchmark dataset PAV, demonstrating superior performance compared to state-of-the-art methods. Code is available at: https://github.com/Ma Ka Rui Nah/SHENet
Researcher Affiliation Collaboration 1Shanghai Tech University 2United Imaging Intelligence {mengmch,caixr,dgshen}@shanghaitech.edu.cn {ziyan.wu,terrence.chen,sean.zhou,fan.yang03}@uii-ai.com
Pseudocode Yes Algorithm 1 Group trajectory bank constructing
Open Source Code Yes Code is available at: https://github.com/Ma Ka Rui Nah/SHENet
Open Datasets Yes We evaluate our method on ETH [24], UCY [13], PAV and Stanford Drone Dataset (SDD) [26] datasets. ... Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] All data used in the paper is public; our code are provided
Dataset Splits No We divide the videos into training (80%) and testing (20%) sets. ... No explicit mention of a separate validation set split or how it was used beyond the training/testing division.
Hardware Specification Yes We train the model for 100 epochs on 4 NVIDIA Quadro RTX 6000 GPUs and use the Adam optimizer with a fixed learning rate 1e 5.
Software Dependencies No The paper mentions software like "Swin Transformer" and "Adam optimizer" but does not specify their version numbers or other crucial software dependencies with versions for reproducibility.
Experiment Setup Yes In SHENet, the initial size of group trajectory bank is set to |Zbank| = 32. Both the trajectory encoder and the scene encoder have 4 self-attention (SA) layers. The cross-modal transformer is with 6 SA layers and cross-attention (CA) layers. We set all the embed dimensions to 512. For the trajectory encoder, it learns the human motion information with size of Tpas 512 (Tpas = 8 in ETH/UCY, Tpas = 10 in PAV). For the scene encoder, it outputs the semantic features with size 150 56 56. We reshape the features from size 150 56 56 to 150 3136, and project them from dimension 150 3136 to 150 512. We train the model for 100 epochs on 4 NVIDIA Quadro RTX 6000 GPUs and use the Adam optimizer with a fixed learning rate 1e 5.