Multi-View People Detection in Large Scenes via Supervised View-Wise Contribution Weighting
Authors: Qi Zhang, Yunfei Gong, Daijie Chen, Antoni B. Chan, Hui Huang
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate the effectiveness of our approach in achieving promising cross-scene multi-view people detection performance. |
| Researcher Affiliation | Academia | 1College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China 2Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China 3Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China |
| Pseudocode | No | The paper describes the model architecture and process in detail but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | We introduce 4 datasets used in the multi-view people detection, including CVCS (Zhang, Lin, and Chan 2021), City Street (Zhang and Chan 2019), Wildtrack (Chavdarova et al. 2018) and Multiview X (Hou, Zheng, and Gould 2020), among which the latter 2 datasets are relatively smaller in the scene size (see dataset comparison in Table 1). |
| Dataset Splits | Yes | CVCS is a synthetic multi-view people dataset, containing 31 scenes, where 23 are for training and the rest 8 for testing... The ground plane map resolution is 900 800, where each grid stands for 0.1 meter in the real world. In the training, 5 views are randomly selected for 5 times in each iteration per frame of each scene, and the same view number is randomly selected for 21 times in evaluation. |
| Hardware Specification | No | The paper states, 'The proposed model is based on Res Net/VGG backbone,' but does not provide specific hardware details such as GPU/CPU models or memory specifications used for the experiments. |
| Software Dependencies | No | The paper mentions using ResNet/VGG backbones but does not specify any software names with version numbers for libraries, frameworks, or other dependencies. |
| Experiment Setup | Yes | For the view-wise contribution weighted fusion, the single-view predictions are fed into a 4-layer subnet: [3 3 1 256, 3 3 256 256, 3 3 256 128, 3 3 128 1]. The map classification threshold is 0.4 for all datasets, and the distance threshold is 1m (5 pixels) on CVCS, 2m (20 pixels) on City Street, and 0.5m (5 pixels) on Multiview X and Wildtrack. As to the model training, a 3-stage training is used: First, the 2D counting task is trained as the pretraining for the feature extraction subnet; Then, the projected singleview decoding subnet is trained after loading the pre-trained feature extraction subnet; Finally, the projected single-view decoding subnet and the multi-view decoding subnet are trained together, where the loss term weight λ = 1. We follow other training settings as in MVDet. |