End-to-End Text Recognition with Hybrid HMM Maxout Models
Authors: Ouais Alsharif; Joelle Pineau
ICLR 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using these elements, we build a tunable and highly accurate recognition system that beats state-of-the-art results on all the sub-problems for both the ICDAR 2003 and SVT benchmark datasets. |
| Researcher Affiliation | Academia | Ouais Alsharif OUAIS.ALSHARIF@MAIL.MCGILL.CA Reasoning and Learning Laboratory, School of Computer Science, Mc Gill University, Montreal, QC, Canada Joelle Pineau JPINEAU@CS.MCGILL.CA Reasoning and Learning Laboratory, School of Computer Science, Mc Gill University, Montreal, QC, Canada |
| Pseudocode | Yes | Algorithm 1 Cascade Beam Search |
| Open Source Code | No | Code for this paper will be provided with the final version. |
| Open Datasets | Yes | The dataset we use for this task is the ICDAR 2003 character recognition dataset (Lucas et al., 2003) which consists of 6114 training samples and 5379 test samples after removing all non-alphanumeric characters as in (Wang et al., 2012). We augment the training dataset with 75,495 character images from the Chars74k English dataset (de Campos et al., 2009) and 50,000 synthetic characters generated by (Wang et al., 2012) making the total size of the training set 131,609 tightly cropped character images. |
| Dataset Splits | No | The dataset we use for this task is the ICDAR 2003 character recognition dataset (Lucas et al., 2003) which consists of 6114 training samples and 5379 test samples after removing all non-alphanumeric characters as in (Wang et al., 2012). |
| Hardware Specification | No | Training was done on GPUs using Theano (Bergstra et al., 2010) and pylearn (Goodfellow et al., 2013a). (No specific GPU model or other hardware details provided). |
| Software Dependencies | No | Training was done on GPUs using Theano (Bergstra et al., 2010) and pylearn (Goodfellow et al., 2013a). (Specific versions of these software dependencies are not provided.) |
| Experiment Setup | Yes | The architecture we use for this task is a five-layer convolutional Maxout network with the first three layers being convolution-pooling Maxout layers, the fourth a Maxout layer and finally a softmax layer on top. The first three layers have respectively 48, 128, 128 filters of sizes 8-by-8 for the first two and 5-by-5 for the third, pooling over regions of sizes 4-by-4, 4-by-4 and 2-by-2 respectively, with 2 linear pieces per Maxout unit and a 2-by-2 stride. The 4th layer has 400 units and 5 linear pieces per Maxout unit, fully connected with the softmax output layer. We train the proposed network on 32-by-32 grey-scale character image patches with a simple preprocessing stage of subtracting the mean of every patch and dividing by its standard deviation + ϵ. Similar to (Goodfellow et al., 2013b), we train this network using stochastic gradient descent with momentum and dropout to maximize log p(y|x). |