Cold-Start Reinforcement Learning with Softmax Policy Gradient

Authors: Nan Ding, Radu Soricut

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evidence validates this method on automatic summarization and image captioning tasks. We numerically evaluate our method on two sequence generation benchmarks, a headline-generation task and an image-caption generation task (Section 5).
Researcher Affiliation Industry Nan Ding Google Inc. Venice, CA 90291 dingnan@google.com Radu Soricut Google Inc. Venice, CA 90291 rsoricut@google.com
Pseudocode Yes The details about the gradient evaluation for the bang-bang rewarded softmax value function are described in Algorithm 1 of the Supplementary Material.
Open Source Code No The paper does not provide an unambiguous statement about releasing its source code or a direct link to a repository for the described methodology.
Open Datasets Yes In our experiments, the supervised data comes from the English Gigaword [9], and consists of news-articles paired with their headlines. For the image-captioning task, we use the standard MSCOCO dataset [14].
Dataset Splits Yes We use a training set of about 6 million article-headline pairs, in addition to two randomly-extracted validation and evaluation sets of 10K examples each. We combine the training and validation datasets for training our model, and hold out a subset of 4K images as our validation set.
Hardware Specification No The paper mentions '40 workers' and '10 parameter servers' for computing updates but does not specify any concrete hardware details such as GPU models, CPU types, or memory amounts.
Software Dependencies Yes We implemented all the algorithms using Tensor Flow 1.0 [6].
Experiment Setup Yes The model is optimized using ADAGRAD with a mini-batch size of 200, a learning rate of 0.01, and gradient clipping with norm equal to 4. We run the training procedure for 10M steps...