Learning Temporal Pose Estimation from Sparsely-Labeled Videos

Authors: Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present our results on the Pose Track [22] dataset. We demonstrate the effectiveness of our approach on three applications: 1) video pose propagation, 2) training a network on annotations augmented with propagated pose pseudo-labels, 3) temporal pose aggregation during inference.
Researcher Affiliation Collaboration Gedas Bertasius1,2, Christoph Feichtenhofer1, Du Tran1, Jianbo Shi2, Lorenzo Torresani1 1Facebook AI, 2University of Pennsylvania
Pseudocode No The paper includes architectural diagrams (Figure 1 and Figure 2) but does not provide any structured pseudocode or algorithm blocks.
Open Source Code Yes Code has been made available at: https://github. com/facebookresearch/Pose Warper.
Open Datasets Yes Our trained Pose Warper can then be used for several applications. [...] and leads to state-of-the-art pose detection results on the Pose Track2017 and Pose Track2018 datasets [22].
Dataset Splits Yes We train our Pose Warper on sparsely labeled videos from the training set of Pose Track2017 [22] and then perform our evaluations on the validation set.
Hardware Specification Yes The training is performed using 4 Tesla M40 GPUs, and is terminated after 20 epochs.
Software Dependencies No The paper mentions using Adam optimizer and HRNet-W48 as backbone, but does not provide specific version numbers for any software dependencies, libraries, or frameworks used for implementation.
Experiment Setup Yes Implementation Details. Following the framework in [27], for training, we crop a 384 288 bounding box around each person and use it as input to our model. During training, we use ground truth person bounding boxes. We also employ random rotations, scaling, and horizontal flipping to augment the data. To learn the network, we use the Adam optimizer [54] with a base learning rate of 10 4, which is reduced to 10 5 and 10 6 after 10, and 15 epochs, respectively. The training is performed using 4 Tesla M40 GPUs, and is terminated after 20 epochs. We initialize our model with a HRNet-W48 [27] pretrained for a COCO keypoint estimation task. To train the deformable warping module, we select Frame B, with a random time-gap δ [ 3, 3] relative to Frame A. To compute features relating the two frames, we use twenty 3 3 residual blocks each with 128 channels. To compute the offsets o(d), we use five 3 3 convolutional layers, each using a different dilation rate (d = 3, 6, 12, 18, 24). To resample the pose heatmap f B, we employ five 3 3 deformable convolutional layers, each applied to one of the five predicted offset maps o(d). The five deformable convolution layers too employ different dilation rates of 3, 6, 12, 18, 24.