Residual Loss Prediction: Reinforcement Learning With No Incremental Feedback

Authors: Hal Daumé III, John Langford, Amr Sharaf

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally, we show the efficacy of RESLOPE on four benchmark reinforcement problems and three bandit structured prediction problems (5.1), comparing to several reinforcement learning algorithms: Reinforce, Proximal Policy Optimization and Advantage Actor-Critic.
Researcher Affiliation Collaboration Hal Daum e III University of Maryland & Microsoft Research NYC me@hal3.name John Langford Microsoft Research NYC jcl@microsoft.com Amr Sharaf University of Maryland amr@cs.umd.edu
Pseudocode Yes Algorithm 1 RESIDUAL LOSS PREDICTION (RESLOPE) with single deviations
Open Source Code Yes The code is available at https://github.com/hal3/macarico,https://github.com/ hal3/reslope
Open Datasets Yes We perform experiments on the three tasks described in detail in Appendix G: English Part of Speech Tagging, English Dependency Parsing and Chinese Part of Speech Tagging. ... English POS Tagging we conduct POS tagging experiments over the 45 Penn Treebank (Marcus et al., 1993) tags.
Dataset Splits Yes We measure performance in terms of average cumulative loss on the online examples as well as on a held-out evaluation dataset.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU, GPU models) used for the experiments.
Software Dependencies No We implement our models on top of the DyNet neural network optimization package (Neubig et al., 2017). ... We optimize all parameters of the model using the Adam9 optimizer (Kingma & Ba, 2014)... The paper mentions software packages like DyNet and Adam but does not provide specific version numbers for them.
Experiment Setup Yes We optimize all parameters of the model using the Adam9 optimizer (Kingma & Ba, 2014), with a tuned learning rate, a moving average rate for the mean of β1 = 0.9 and for the variance of β2 = 0.999; epsilon (for numerical stability) is fixed at 1e 8 (these are the DyNet defaults). The learning rate is tuned in the range {0.050.01, 0.005, 0.001, 0.0005, 0.0001}. For the structured prediction experiments, the following input features hyperparameters are tuned: Word embedding dimension {50, 100, 200, 300}, Bi LSTM dimension {50, 150, 300}, Number of Bi LSTM layers {1, 2}, Policy RNN dimension {50, 150, 300}, Number of policy layers {1, 2}, Roll-out probability β {0.0, 0.5, 1.0}.