Cold-Start Reinforcement Learning with Softmax Policy Gradient
Authors: Nan Ding, Radu Soricut
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evidence validates this method on automatic summarization and image captioning tasks. We numerically evaluate our method on two sequence generation benchmarks, a headline-generation task and an image-caption generation task (Section 5). |
| Researcher Affiliation | Industry | Nan Ding Google Inc. Venice, CA 90291 dingnan@google.com Radu Soricut Google Inc. Venice, CA 90291 rsoricut@google.com |
| Pseudocode | Yes | The details about the gradient evaluation for the bang-bang rewarded softmax value function are described in Algorithm 1 of the Supplementary Material. |
| Open Source Code | No | The paper does not provide an unambiguous statement about releasing its source code or a direct link to a repository for the described methodology. |
| Open Datasets | Yes | In our experiments, the supervised data comes from the English Gigaword [9], and consists of news-articles paired with their headlines. For the image-captioning task, we use the standard MSCOCO dataset [14]. |
| Dataset Splits | Yes | We use a training set of about 6 million article-headline pairs, in addition to two randomly-extracted validation and evaluation sets of 10K examples each. We combine the training and validation datasets for training our model, and hold out a subset of 4K images as our validation set. |
| Hardware Specification | No | The paper mentions '40 workers' and '10 parameter servers' for computing updates but does not specify any concrete hardware details such as GPU models, CPU types, or memory amounts. |
| Software Dependencies | Yes | We implemented all the algorithms using Tensor Flow 1.0 [6]. |
| Experiment Setup | Yes | The model is optimized using ADAGRAD with a mini-batch size of 200, a learning rate of 0.01, and gradient clipping with norm equal to 4. We run the training procedure for 10M steps... |