Hessian-free Optimization for Learning Deep Multidimensional Recurrent Neural Networks

Authors: Minhyung Cho, Chandra Dhir, Jaehyung Lee

NeurIPS 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results for offline handwriting and phoneme recognition show that an MDRNN with HF optimization performs better as the depth of the network increases up to 15 layers.
Researcher Affiliation Collaboration Minhyung Cho Chandra Shekhar Dhir Jaehyung Lee Applied Research Korea, Gracenote Inc. {mhyung.cho,shekhardhir}@gmail.com jaehyung.lee@kaist.ac.kr
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets Yes The IFN/ENIT Database [20] is a database of handwritten Arabic words, which consists of 32,492 images. The TIMIT corpus [21] is a benchmark database for evaluating speech recognition performance.
Dataset Splits Yes The 25,955 images corresponding to the subsets (b e) were used for training. The validation set consisted of 3,269 images corresponding to the first half of the sorted list in alphabetical order (ae07 001.tif ai54 028.tif) in set a. For the TIMIT corpus, The standard training, validation, and core datasets were used. Each set contains 3,696 sentences, 400 sentences, and 192 sentences, respectively.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup Yes For HF optimization, we followed the basic setup described in [8], but different parameters were utilized. Tikhonov damping was used together with Levenberg-Marquardt heuristics. The value of the damping parameter λ was initialized to 0.1, and adjusted according to the reduction ratio ρ (multiplied by 0.9 if ρ > 0.75, divided by 0.9 if ρ < 0.25, and unchanged otherwise). For SGD optimization, the learning rate ϵ was chosen from {10 4, 10 5, 10 6}, and the momentum µ from {0.9, 0.95, 0.99}. We applied Gaussian weight noise of standard deviation σ = {0.03, 0.04, 0.05} together with L2 regularization of strength 0.001.