VINet: Visual-Inertial Odometry as a Sequence-to-Sequence Learning Problem

Authors: Ronald Clark, Sen Wang, Hongkai Wen, Andrew Markham, Niki Trigoni

AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper we present an on-manifold sequence-to-sequence learning approach to motion estimation using visual and inertial sensors. It is to the best of our knowledge the first end-to-end trainable method for visual-inertial odometry which performs fusion of the data at an intermediate feature-representation level. Our method has numerous advantages over traditional approaches. Specifically, it eliminates the need for tedious manual synchronization of the camera and IMU as well as eliminating the need for manual calibration between the IMU and camera. A further advantage is that our model naturally and elegantly incorporates domain specific information which significantly mitigates drift. We show that our approach is competitive with state-of-the-art traditional methods when accurate calibration data is available and can be trained to outperform them in the presence of calibration and synchronization errors.
Researcher Affiliation Academia Ronald Clark, Sen Wang, Hongkai Wen, Andrew Markham, Niki Trigoni Department of Computer Science,University of Oxford, United Kingdom Email: {firstname.lastname}@cs.ox.ac.uk
Pseudocode Yes Algorithm 1 Joint training of se(3) and SE(3) loss while i niter do w1:n = w1:n λ1 LSE(3)(wl,xt) w1:j = w1:j λ2 Lse(3)(wl,xt) wl end while
Open Source Code No The paper does not provide a direct link or explicit statement about the release of source code.
Open Datasets Yes UAV: Challenging Indoor Trajectory We first evaluate our approach on the publicly-available indoor Eu Ro C micro-aerial-vehicle (MAV) dataset (Burri et al. 2016). The data for this dataset was captured using a Asc Tec Firefly MAV with a front-facing visual-inertial sensor unit with tight synchronization between the camera and IMU timestamps.
Dataset Splits Yes The training performance in Fig. 6 shows the difference between training solely on the F-2-F displacements, solely on the full SE(3) pose and using our joint training method. The results show that joint training allows the network to converge more quickly towards low-error estimates over the training and validation sequences, while the F-2-F training converges very slowly and training on the full pose converges to a high-error estimate.
Hardware Specification Yes A forward pass of images through the CNN part of the network takes on average 160ms ( 10Hz) on a single Tesla k80. The LSTM updates are much less computationally expensive and can run at > 200Hz on the Tesla k80.
Software Dependencies Yes For our experiments, we implemented our model using the Theano library (Bergstra et al. 2010)
Experiment Setup Yes For our network we use LSTMs with 2 layers with cells of 1000 units. Our CNN total of 55, 897 trainable weights. A forward pass of images through the CNN part of the network takes on average 160ms ( 10Hz) on a single Tesla k80. The LSTM updates are much less computationally expensive and can run at > 200Hz on the Tesla k80. The entire network is trained using Backpropagation Through Time (BPTT). We use standard BPTT which works by unfolding the network for a selected number of timesteps, T, and then applying the standard backpropagation learning method involving two passesa forward pass and backward pass. In the forward pass of BPTT, the activations of the network from Equations 1 to 6 are calculated successively for each timestep from time t = 1 to T. Using the resulting activations, the backward pass proceeds from time t = T to t = 1 calculating the derivatives of each output unit with respect to the layer input (xl) and weights of the layer (wl). The final derivatives are then determined by summing over the time-steps. Stochastic Gradient Decent (SGD) with an RMSProp adaptive learning rate is used as the to update the weights of the networks determined by the BPTT. SGD is a simple and popular method that performs very well for training a variety of machine learning models using large datasets (Bottou and Bousquet 2008). Using SGD, the weights of the network are updated as follows wl = wl λ L(wl, xt) where wl represents a parameter (weight or bias) of the network indexed by l and the learning rate (λ), which determines how strongly the derivatives influence the weight updates during each iteration of SGD. For all our training we select the best learning rate. To reduce the memory required, but still keep continuity during training, we use the training structure where the training is carried out over a sliding window of batches, with the hidden state of the LSTM carried over between windows illustrated in Fig. 4. Finally, we found that training the network directly through the SE(3) accumulation is particularly difficult as the training procedure suffers from many local minima. In order to overcome this difficulty, we consider two losses, one based on the se(3) frame-to-frame (F-2-F) predictions and the other on the SE(3) full concatenated pose relative to the start of the sequence. The loss computed from the F-2-F pose is Lse(3) = α ||ω ˆω|| + β||v ˆv|| (13) For full concatenated pose in SE(3), we use a quaternionic representation for the orientation, giving the loss LSE(3) = α ||q ˆq|| + β||T ˆT|| (14) We consider three types of training; training only the Lse(3) loss, only the LSE(3) and joint training of both losses. The weight updates for the joint training is shown in Algorithm 1. During training we start with a high relative learning rate for the se(3) loss with λ2/λ1 100 and then reduce this to a very low value λ2/λ1 100 during the later epochs to fine-tune the concatenated pose estimation. We trained the model for each dataset for 200 epochs, which took on average 6 hours per dataset.