End-to-end Active Object Tracking via Reinforcement Learning

Authors: Wenhan Luo, Peng Sun, Fangwei Zhong, Wei Liu, Tong Zhang, Yizhou Wang

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The tracker trained in simulators (Vi ZDoom, Unreal Engine) shows good generalization in the case of unseen object moving path, unseen object appearance, unseen background, and distracting object. It can restore tracking when occasionally losing the target. With the experiments over the VOT dataset, we also find that the tracking ability, obtained solely from simulators, can potentially transfer to real-world scenarios.
Researcher Affiliation Collaboration 1Tencent AI Lab 2Peking University.
Pseudocode No The paper describes algorithms and network architecture in text and diagrams but does not include explicit pseudocode blocks or sections labeled 'Algorithm'.
Open Source Code No The paper does not contain any explicit statement about releasing the source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets Yes Finally, we perform qualitative evaluation on some video clips taken from the VOT dataset (Kristan et al., 2016).
Dataset Splits No The paper mentions 'best validation result' but does not specify how validation sets were created, their sizes, or any other details for reproducibility of the validation split.
Hardware Specification No The paper does not explicitly describe the specific hardware components (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies No The paper mentions software like Vi ZDoom, Unreal Engine, Unreal CV, Open AI Gym, OpenCV, and Dlib, but it does not provide specific version numbers for these software dependencies.
Experiment Setup Yes To be more specific, the tracker observes the raw visual state and takes one action from the action set A = {turn-left, turn-right, turn-left-and-move-forward, turn-right-and-move-forward, move-forward, no-op}... The screen is resized to 84 84 3 RGB image as the network input... x2 + (y d)2 where A > 0, c > 0, d > 0, λ > 0 are tuning parameters... we let the reward threshold be -450 and the maximum length be 3000, respectively... This map is then augmented as described in Sec. 3.5 with N = 21.