reproducibilityindex.ai

Multi-View People Detection in Large Scenes via Supervised View-Wise Contribution Weighting

Authors: Qi Zhang, Yunfei Gong, Daijie Chen, Antoni B. Chan, Hui Huang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate the effectiveness of our approach in achieving promising cross-scene multi-view people detection performance.
Researcher Affiliation	Academia	1College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China 2Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, China 3Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China
Pseudocode	No	The paper describes the model architecture and process in detail but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statements or links indicating that the source code for the described methodology is publicly available.
Open Datasets	Yes	We introduce 4 datasets used in the multi-view people detection, including CVCS (Zhang, Lin, and Chan 2021), City Street (Zhang and Chan 2019), Wildtrack (Chavdarova et al. 2018) and Multiview X (Hou, Zheng, and Gould 2020), among which the latter 2 datasets are relatively smaller in the scene size (see dataset comparison in Table 1).
Dataset Splits	Yes	CVCS is a synthetic multi-view people dataset, containing 31 scenes, where 23 are for training and the rest 8 for testing... The ground plane map resolution is 900 800, where each grid stands for 0.1 meter in the real world. In the training, 5 views are randomly selected for 5 times in each iteration per frame of each scene, and the same view number is randomly selected for 21 times in evaluation.
Hardware Specification	No	The paper states, 'The proposed model is based on Res Net/VGG backbone,' but does not provide specific hardware details such as GPU/CPU models or memory specifications used for the experiments.
Software Dependencies	No	The paper mentions using ResNet/VGG backbones but does not specify any software names with version numbers for libraries, frameworks, or other dependencies.
Experiment Setup	Yes	For the view-wise contribution weighted fusion, the single-view predictions are fed into a 4-layer subnet: [3 3 1 256, 3 3 256 256, 3 3 256 128, 3 3 128 1]. The map classification threshold is 0.4 for all datasets, and the distance threshold is 1m (5 pixels) on CVCS, 2m (20 pixels) on City Street, and 0.5m (5 pixels) on Multiview X and Wildtrack. As to the model training, a 3-stage training is used: First, the 2D counting task is trained as the pretraining for the feature extraction subnet; Then, the projected singleview decoding subnet is trained after loading the pre-trained feature extraction subnet; Finally, the projected single-view decoding subnet and the multi-view decoding subnet are trained together, where the loss term weight λ = 1. We follow other training settings as in MVDet.