Sequence Prediction with Unlabeled Data by Reward Function Learning
Authors: Lijun Wu, Li Zhao, Tao Qin, Jianhuang Lai, Tie-Yan Liu
IJCAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that the pseudo reward can provide good supervision and guide the learning process on unlabeled data. We observe significant improvements on both neural machine translation and text summarization. |
| Researcher Affiliation | Collaboration | Lijun Wu1, Li Zhao2, Tao Qin2, Jianhuang Lai1,3 and Tie-Yan Liu2 1School of Data and Computer Science, Sun Yat-sen University 2Microsoft Research Asia 3Guangdong Key Laboratory of Information Security Technology |
| Pseudocode | Yes | Algorithm 1: REINFORCE Training for Sequence Prediction with Unlabeled Data; Algorithm 2: Complete Algorithm for Sequence Prediction with Unlabeled Data |
| Open Source Code | No | The paper does not provide a link or explicit statement about the availability of their *own* source code for the methodology described. It only refers to the open-source code of a baseline model. |
| Open Datasets | Yes | For the machine translation experiment, we use data from the German-English machine translation track of the IWSLT 2014 evaluation campaign [Cettolo et al., 2014]... The data set we use to train and evaluate our model on text summarization is from a subset of Gigaword Corpus [Graff and Cieri, 2003]. |
| Dataset Splits | Yes | The data consists of training/dev/test corpus with 153326, 6969 and 6750 sentence pairs respectively. ... The number of sentence pairs in the training set, validation set and test set are 189295, 18475 and 10000 respectively |
| Hardware Specification | No | No specific hardware details (e.g., CPU/GPU models, memory specifications, or cloud instance types) used for running the experiments are mentioned in the paper. |
| Software Dependencies | No | The paper mentions 'Blocks [Van Merri enboer et al., 2015]' and 'Torch [Collobert et al., 2011]' but does not provide specific version numbers for these or any other software components used. |
| Experiment Setup | Yes | Our policy is a GRU with 256 hidden units... The reward network is similar to the policy function except the concatenation of decoder hidden state and context vector are fed into another multilayer perceptron (MLP) with 256 hidden units... Hyper parameter α is 0.5, and the delay constant γ is 0.1. |