Unbiased Online Recurrent Optimization

Authors: Corentin Tallec, Yann Ollivier

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In Section 6, UORO is shown to provide convergence on a set of synthetic experiments where truncated BPTT fails to display reliable convergence. An implementation of UORO is provided as supplementary material.
Researcher Affiliation Academia Corentin Tallec Laboratoire de Recherche en Informatique Université Paris Sud Gif-sur-Yvette, 91190, France corentin.tallec@u-psud.fr Yann Ollivier Laboratoire de Recherche en Informatique Université Paris Sud Gif-sur-Yvette, 91190, France yann@yann-ollivier.org
Pseudocode Yes The resulting algorithm is detailed in Alg. 1.
Open Source Code Yes An implementation of UORO is provided as supplementary material.
Open Datasets Yes To monitor the variance of UORO s estimate over time, a 64-unit GRU recurrent network is trained on the first 107 characters of the full works of Shakespeare using UORO.
Dataset Splits No The paper describes training on sequences and evaluation, but does not specify a distinct validation set with explicit split percentages or counts for hyperparameter tuning. For example, "Optimization was performed using Adam with the default setting β1 = 0.9 and β2 = 0.999, and a decreasing learning rate ηt = γ 1+α t, with t the number of characters processed."
Hardware Specification No The paper mentions using a "64-unit GRU recurrent network" but does not specify any hardware components like CPU or GPU models, memory, or specific computing environments used for the experiments.
Software Dependencies No The paper mentions using "Adam with the default setting β1 = 0.9 and β2 = 0.999" and "vanilla SGD", but does not provide version numbers for any specific software libraries or frameworks (e.g., TensorFlow, PyTorch, Python version) that would be needed for replication.
Experiment Setup Yes Optimization was performed using Adam with the default setting β1 = 0.9 and β2 = 0.999, and a decreasing learning rate ηt = γ 1+α t, with t the number of characters processed. ... (with learning rates using α = 0.015 and γ = 10 3). ... The learning rates used α = 0.03 and γ = 10 3.