Learning Temporal Pose Estimation from Sparsely-Labeled Videos
Authors: Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present our results on the Pose Track [22] dataset. We demonstrate the effectiveness of our approach on three applications: 1) video pose propagation, 2) training a network on annotations augmented with propagated pose pseudo-labels, 3) temporal pose aggregation during inference. |
| Researcher Affiliation | Collaboration | Gedas Bertasius1,2, Christoph Feichtenhofer1, Du Tran1, Jianbo Shi2, Lorenzo Torresani1 1Facebook AI, 2University of Pennsylvania |
| Pseudocode | No | The paper includes architectural diagrams (Figure 1 and Figure 2) but does not provide any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code has been made available at: https://github. com/facebookresearch/Pose Warper. |
| Open Datasets | Yes | Our trained Pose Warper can then be used for several applications. [...] and leads to state-of-the-art pose detection results on the Pose Track2017 and Pose Track2018 datasets [22]. |
| Dataset Splits | Yes | We train our Pose Warper on sparsely labeled videos from the training set of Pose Track2017 [22] and then perform our evaluations on the validation set. |
| Hardware Specification | Yes | The training is performed using 4 Tesla M40 GPUs, and is terminated after 20 epochs. |
| Software Dependencies | No | The paper mentions using Adam optimizer and HRNet-W48 as backbone, but does not provide specific version numbers for any software dependencies, libraries, or frameworks used for implementation. |
| Experiment Setup | Yes | Implementation Details. Following the framework in [27], for training, we crop a 384 288 bounding box around each person and use it as input to our model. During training, we use ground truth person bounding boxes. We also employ random rotations, scaling, and horizontal flipping to augment the data. To learn the network, we use the Adam optimizer [54] with a base learning rate of 10 4, which is reduced to 10 5 and 10 6 after 10, and 15 epochs, respectively. The training is performed using 4 Tesla M40 GPUs, and is terminated after 20 epochs. We initialize our model with a HRNet-W48 [27] pretrained for a COCO keypoint estimation task. To train the deformable warping module, we select Frame B, with a random time-gap δ [ 3, 3] relative to Frame A. To compute features relating the two frames, we use twenty 3 3 residual blocks each with 128 channels. To compute the offsets o(d), we use five 3 3 convolutional layers, each using a different dilation rate (d = 3, 6, 12, 18, 24). To resample the pose heatmap f B, we employ five 3 3 deformable convolutional layers, each applied to one of the five predicted offset maps o(d). The five deformable convolution layers too employ different dilation rates of 3, 6, 12, 18, 24. |