Multi-scale Spatial Representation Learning via Recursive Hermite Polynomial Networks

Authors: Lin (Yuanbo) Wu, Deyin Liu, Xiaojie Guo, Richang Hong, Liangchen Liu, Rui Zhang

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments are conducted to demonstrate the efficacy of our design, and reveal its superiority over state-of-the-art alternatives on a variety of image recognition tasks. Besides, introspective studies are provided to further understand the properties of our method.
Researcher Affiliation Academia 1Key Laboratory of Knowledge Engineering with Big Data, Ministry of Education; School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China. 2Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Artificial Intelligence, Anhui University, Hefei 230039, China. 3Tianjin University, China 4 The University of Melbourne, Victoria 3052, Australia. 5www.ruizhang.info
Pseudocode No The paper describes the proposed method using mathematical equations and diagrams (e.g., Figure 1) but does not include a dedicated pseudocode or algorithm block.
Open Source Code No The paper does not provide an explicit statement or link to the open-source code for the RHP-Nets methodology described.
Open Datasets Yes We use the Caltech UCSD Birds-200-2011 (CUB) [Wah et al., 2010] as the benchmark dataset. CUB consists of 200 classes with 5,994 training and 5,794 testing images. We use the Cityscapes dataset [Cordts et al., 2016], which contains 5,000 images recorded from street scenes in 50 different cities. The dataset is annotated with 30 categories, and 19 categories are used for training and evaluation. The training, validation and test set contains 2975, 500, and 1525 images, respectively. The MS COCO dataset [Lin et al., 2014] has 80 categories, which contains 115k images for training (train2017), 5k images for validation (val2017), and 20k images for testing (test-dev). PRW [Zheng et al., 2017] is a person search benchmark. The dataset contains a total of 11,816 video frames and 43,100 person bounding boxes. The training set has 482 different identities from 5,704 raw video frames and the testing set has 2,057 probe IDs along with a gallery repository of 6,122 images.
Dataset Splits Yes The training, validation and test set contains 2975, 500, and 1525 images, respectively. (Cityscapes)
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies No The paper mentions that 'All experiments are implemented based on mmdetecton [Chen et al., 2018a]' but does not provide specific version numbers for mmdetection or any other software dependencies.
Experiment Setup Yes The HPB is plugged into the network, e.g., Res Net [He et al., 2016] from level 4 to level 7 nested into the 3 3 convolution, where the dilation rates are set d = {2, 4, 2, 1}, from L4 to L7, respectively. In this way, we achieve the spatial accuracy across convolutions to benefit dense prediction. To produce sub-scale granular features without artifacts, we realize the spatial frequency inside each layer by compositing Hermite polynomials into the dilated grids. This leads to the proposed RHP-Nets by simultaneously achieving the multi-scale representations with high spatial variance and the sub-scale features for boosting dense prediction. The activations on each level are shown in Fig.3. Note that batch normalization is applied before the Hermite polynomial transformation. We tune the choice of the number of Hermite polynomials N by N {0, 2, 4, 6, 8} in our ablation study. In all experiments, we set N = 4 as default. ... We consider a foreground saliency mask predictor [Liu et al., 2010], where each pixel is trained with the binary cross-entropy loss against the target mask. The mask is built by labeling pixels inside the ground truth boxes as foreground. ... Data augmentation is performed with random horizontal flipping, cropping and scaling. ... The learning rate is initialized to be 0.001, and the warming up skill is applied by reducing the learning rate as more epochs. Each image is randomly erased with a region to reduce the over-fitting effect.