reproducibilityindex.ai

Learning Temporal Pose Estimation from Sparsely-Labeled Videos

Authors: Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we present our results on the Pose Track [22] dataset. We demonstrate the effectiveness of our approach on three applications: 1) video pose propagation, 2) training a network on annotations augmented with propagated pose pseudo-labels, 3) temporal pose aggregation during inference.
Researcher Affiliation	Collaboration	Gedas Bertasius1,2, Christoph Feichtenhofer1, Du Tran1, Jianbo Shi2, Lorenzo Torresani1 1Facebook AI, 2University of Pennsylvania
Pseudocode	No	The paper includes architectural diagrams (Figure 1 and Figure 2) but does not provide any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code has been made available at: https://github. com/facebookresearch/Pose Warper.
Open Datasets	Yes	Our trained Pose Warper can then be used for several applications. [...] and leads to state-of-the-art pose detection results on the Pose Track2017 and Pose Track2018 datasets [22].
Dataset Splits	Yes	We train our Pose Warper on sparsely labeled videos from the training set of Pose Track2017 [22] and then perform our evaluations on the validation set.
Hardware Specification	Yes	The training is performed using 4 Tesla M40 GPUs, and is terminated after 20 epochs.
Software Dependencies	No	The paper mentions using Adam optimizer and HRNet-W48 as backbone, but does not provide specific version numbers for any software dependencies, libraries, or frameworks used for implementation.
Experiment Setup	Yes	Implementation Details. Following the framework in [27], for training, we crop a 384 288 bounding box around each person and use it as input to our model. During training, we use ground truth person bounding boxes. We also employ random rotations, scaling, and horizontal ﬂipping to augment the data. To learn the network, we use the Adam optimizer [54] with a base learning rate of 10 4, which is reduced to 10 5 and 10 6 after 10, and 15 epochs, respectively. The training is performed using 4 Tesla M40 GPUs, and is terminated after 20 epochs. We initialize our model with a HRNet-W48 [27] pretrained for a COCO keypoint estimation task. To train the deformable warping module, we select Frame B, with a random time-gap δ [ 3, 3] relative to Frame A. To compute features relating the two frames, we use twenty 3 3 residual blocks each with 128 channels. To compute the offsets o(d), we use ﬁve 3 3 convolutional layers, each using a different dilation rate (d = 3, 6, 12, 18, 24). To resample the pose heatmap f B, we employ ﬁve 3 3 deformable convolutional layers, each applied to one of the ﬁve predicted offset maps o(d). The ﬁve deformable convolution layers too employ different dilation rates of 3, 6, 12, 18, 24.