Semi-Supervised Multimodal Deep Learning for RGB-D Object Recognition

Authors: Yanhua Cheng, Xin Zhao, Rui Cai, Zhiwei Li, Kaiqi Huang, Yong Rui

IJCAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on the benchmark RGB-D dataset demonstrate that, with only 5% labeled training data, our approach achieves competitive performance for object recognition compared with those state-of-the-art results reported by fully-supervised methods.
Researcher Affiliation Collaboration Yanhua Cheng1 , Xin Zhao1, Rui Cai2, Zhiwei Li2, Kaiqi Huang1,3, Yong Rui2 1CRIPAC&NLPR, CASIA 2Microsoft Research 3CAS Center for Excellence in Brain Science and Intelligence Technology {yh.cheng, xzhao, kaiqi.huang}@nlpr.ia.cn, {ruicai, zli, yongrui}@microsoft.com
Pseudocode No The paper describes the algorithms in prose and uses diagrams (Fig. 2) but does not include structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets Yes We perform our experiments on the Washington RGB-D dataset [Lai et al., 2011a] captured by Microsoft Kinect.
Dataset Splits Yes To evaluate our semi-supervised learning, we first utilize one of the 10 random splits provided by [Lai et al., 2011a] to divide the dataset into a training set and a testing set. For any split, there are around 35,000 examples for training and around 6,877 for testing. Then we randomly labeled 5% samples (around 1750) of the training set, and remain the rest unlabeled (around 33,250).
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions optimization algorithms (SGD) and architectures (AlexNet) but does not provide specific software dependencies or library version numbers used in the implementation.
Experiment Setup Yes We fix = 0.5, K = 20, β = 1 for our semi-supervised learning method, although dynamically finetuning each parameter could result in a better performance. For the reconstruction network of each modality, we use a mini-batch b = 128 of images and initial learning rate = 10 5, multiplying the learning rate by 0.1 at every s = 4000 iterations. Towards the training of the RGBand depth-DCNN models for recognition during every iteration, we set b = 128, = 10 7, and s = 3000.