SEE: Towards Semi-Supervised End-to-End Scene Text Recognition

Authors: Christian Bartz, Haojin Yang, Christoph Meinel

AAAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce the idea behind our novel approach and show its feasibility, by performing a range of experiments on standard benchmark datasets, where we achieve competitive results.
Researcher Affiliation Academia Christian Bartz, Haojin Yang, Christoph Meinel Hasso Plattner Institute, University of Potsdam Prof.-Dr.-Helmert Straße 2-3 14482 Potsdam, Germany {christian.bartz, haojin.yang, meinel}@hpi.de
Pseudocode No The paper describes the system architecture and its components verbally and with diagrams, but it does not include a formal pseudocode block or an algorithm section.
Open Source Code Yes Our contributions are as follows: ... (5) We provide our code1 and trained models2 to the research community. 1https://github.com/Bartzi/see 2https://bartzi.de/research/see
Open Datasets Yes First, we performed experiments on the SVHN dataset (Netzer et al. 2011)... The third dataset we exerimented with, was the French Street Name Signs (FSNS) dataset (Smith et al. 2016).
Dataset Splits No During our experiments we found that, when trained from scratch, a network that shall detect and recognize more than two text lines does not converge. In order to overcome this problem we designed a curriculum learning strategy (Bengio et al. 2009) for training the system. The complexity of the supplied training images under this curriculum is gradually increasing, once the accuracy on the validation set has settled. The paper mentions using a validation set but does not provide specific details on how this split was created (e.g., percentages, sample counts, or a reference to a predefined split).
Hardware Specification Yes We conducted all our experiments on a work station which has an Intel(R) Core(TM) i76900K CPU, 64 GB RAM and 4 TITAN X (Pascal) GPUs.
Software Dependencies No We implemented all our experiments using Chainer (Tokui et al. 2015). While Chainer is mentioned, a specific version number is not provided, which is required for reproducibility.
Experiment Setup Yes Localization Network: The localization network used in every experiment is based on the Res Net architecture (He et al. 2016a). The number of convolutional filters is 32, 48 and 48 respectively. A 2 2 max-pooling with stride 2 follows after the second residual block. The last residual block is followed by a 5 5 average pooling layer and this layer is followed by a LSTM with 256 hidden units. Recognition Network: In our SVHN experiments, the recognition network has the same structure as the localization network, but the number of convolutional filters is higher. The number of convolutional filters is 32, 64 and 128 respectively. Model Training: In order to overcome this problem we designed a curriculum learning strategy (Bengio et al. 2009) for training the system.