End-to-End Text Recognition with Hybrid HMM Maxout Models

Authors: Ouais Alsharif; Joelle Pineau

ICLR 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using these elements, we build a tunable and highly accurate recognition system that beats state-of-the-art results on all the sub-problems for both the ICDAR 2003 and SVT benchmark datasets.
Researcher Affiliation Academia Ouais Alsharif OUAIS.ALSHARIF@MAIL.MCGILL.CA Reasoning and Learning Laboratory, School of Computer Science, Mc Gill University, Montreal, QC, Canada Joelle Pineau JPINEAU@CS.MCGILL.CA Reasoning and Learning Laboratory, School of Computer Science, Mc Gill University, Montreal, QC, Canada
Pseudocode Yes Algorithm 1 Cascade Beam Search
Open Source Code No Code for this paper will be provided with the final version.
Open Datasets Yes The dataset we use for this task is the ICDAR 2003 character recognition dataset (Lucas et al., 2003) which consists of 6114 training samples and 5379 test samples after removing all non-alphanumeric characters as in (Wang et al., 2012). We augment the training dataset with 75,495 character images from the Chars74k English dataset (de Campos et al., 2009) and 50,000 synthetic characters generated by (Wang et al., 2012) making the total size of the training set 131,609 tightly cropped character images.
Dataset Splits No The dataset we use for this task is the ICDAR 2003 character recognition dataset (Lucas et al., 2003) which consists of 6114 training samples and 5379 test samples after removing all non-alphanumeric characters as in (Wang et al., 2012).
Hardware Specification No Training was done on GPUs using Theano (Bergstra et al., 2010) and pylearn (Goodfellow et al., 2013a). (No specific GPU model or other hardware details provided).
Software Dependencies No Training was done on GPUs using Theano (Bergstra et al., 2010) and pylearn (Goodfellow et al., 2013a). (Specific versions of these software dependencies are not provided.)
Experiment Setup Yes The architecture we use for this task is a five-layer convolutional Maxout network with the first three layers being convolution-pooling Maxout layers, the fourth a Maxout layer and finally a softmax layer on top. The first three layers have respectively 48, 128, 128 filters of sizes 8-by-8 for the first two and 5-by-5 for the third, pooling over regions of sizes 4-by-4, 4-by-4 and 2-by-2 respectively, with 2 linear pieces per Maxout unit and a 2-by-2 stride. The 4th layer has 400 units and 5 linear pieces per Maxout unit, fully connected with the softmax output layer. We train the proposed network on 32-by-32 grey-scale character image patches with a simple preprocessing stage of subtracting the mean of every patch and dividing by its standard deviation + ϵ. Similar to (Goodfellow et al., 2013b), we train this network using stochastic gradient descent with momentum and dropout to maximize log p(y|x).