Forecasting Human Trajectory from Scene History
Authors: Mancheng Meng, Ziyan Wu, Terrence Chen, Xiran Cai, Xiang Zhou, Fan Yang, Dinggang Shen
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive evaluations to validate the efficacy of our proposed framework on ETH, UCY, as well as a new, challenging benchmark dataset PAV, demonstrating superior performance compared to state-of-the-art methods. Code is available at: https://github.com/Ma Ka Rui Nah/SHENet |
| Researcher Affiliation | Collaboration | 1Shanghai Tech University 2United Imaging Intelligence {mengmch,caixr,dgshen}@shanghaitech.edu.cn {ziyan.wu,terrence.chen,sean.zhou,fan.yang03}@uii-ai.com |
| Pseudocode | Yes | Algorithm 1 Group trajectory bank constructing |
| Open Source Code | Yes | Code is available at: https://github.com/Ma Ka Rui Nah/SHENet |
| Open Datasets | Yes | We evaluate our method on ETH [24], UCY [13], PAV and Stanford Drone Dataset (SDD) [26] datasets. ... Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] All data used in the paper is public; our code are provided |
| Dataset Splits | No | We divide the videos into training (80%) and testing (20%) sets. ... No explicit mention of a separate validation set split or how it was used beyond the training/testing division. |
| Hardware Specification | Yes | We train the model for 100 epochs on 4 NVIDIA Quadro RTX 6000 GPUs and use the Adam optimizer with a fixed learning rate 1e 5. |
| Software Dependencies | No | The paper mentions software like "Swin Transformer" and "Adam optimizer" but does not specify their version numbers or other crucial software dependencies with versions for reproducibility. |
| Experiment Setup | Yes | In SHENet, the initial size of group trajectory bank is set to |Zbank| = 32. Both the trajectory encoder and the scene encoder have 4 self-attention (SA) layers. The cross-modal transformer is with 6 SA layers and cross-attention (CA) layers. We set all the embed dimensions to 512. For the trajectory encoder, it learns the human motion information with size of Tpas 512 (Tpas = 8 in ETH/UCY, Tpas = 10 in PAV). For the scene encoder, it outputs the semantic features with size 150 56 56. We reshape the features from size 150 56 56 to 150 3136, and project them from dimension 150 3136 to 150 512. We train the model for 100 epochs on 4 NVIDIA Quadro RTX 6000 GPUs and use the Adam optimizer with a fixed learning rate 1e 5. |