Residual Loss Prediction: Reinforcement Learning With No Incremental Feedback
Authors: Hal Daumé III, John Langford, Amr Sharaf
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally, we show the efficacy of RESLOPE on four benchmark reinforcement problems and three bandit structured prediction problems (5.1), comparing to several reinforcement learning algorithms: Reinforce, Proximal Policy Optimization and Advantage Actor-Critic. |
| Researcher Affiliation | Collaboration | Hal Daum e III University of Maryland & Microsoft Research NYC me@hal3.name John Langford Microsoft Research NYC jcl@microsoft.com Amr Sharaf University of Maryland amr@cs.umd.edu |
| Pseudocode | Yes | Algorithm 1 RESIDUAL LOSS PREDICTION (RESLOPE) with single deviations |
| Open Source Code | Yes | The code is available at https://github.com/hal3/macarico,https://github.com/ hal3/reslope |
| Open Datasets | Yes | We perform experiments on the three tasks described in detail in Appendix G: English Part of Speech Tagging, English Dependency Parsing and Chinese Part of Speech Tagging. ... English POS Tagging we conduct POS tagging experiments over the 45 Penn Treebank (Marcus et al., 1993) tags. |
| Dataset Splits | Yes | We measure performance in terms of average cumulative loss on the online examples as well as on a held-out evaluation dataset. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU, GPU models) used for the experiments. |
| Software Dependencies | No | We implement our models on top of the DyNet neural network optimization package (Neubig et al., 2017). ... We optimize all parameters of the model using the Adam9 optimizer (Kingma & Ba, 2014)... The paper mentions software packages like DyNet and Adam but does not provide specific version numbers for them. |
| Experiment Setup | Yes | We optimize all parameters of the model using the Adam9 optimizer (Kingma & Ba, 2014), with a tuned learning rate, a moving average rate for the mean of β1 = 0.9 and for the variance of β2 = 0.999; epsilon (for numerical stability) is fixed at 1e 8 (these are the DyNet defaults). The learning rate is tuned in the range {0.050.01, 0.005, 0.001, 0.0005, 0.0001}. For the structured prediction experiments, the following input features hyperparameters are tuned: Word embedding dimension {50, 100, 200, 300}, Bi LSTM dimension {50, 150, 300}, Number of Bi LSTM layers {1, 2}, Policy RNN dimension {50, 150, 300}, Number of policy layers {1, 2}, Roll-out probability β {0.0, 0.5, 1.0}. |