Learning Sentence Representation with Guidance of Human Attention
Authors: Shaonan Wang, Jiajun Zhang, Chengqing Zong
IJCAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiments and Results |
| Researcher Affiliation | Academia | 1 National Laboratory of Pattern Recognition, CASIA, Beijing, China 2 University of Chinese Academy of Sciences, Beijing, China 3 CAS Center for Excellence in Brain Science and Intelligence Technology, Shanghai, China |
| Pseudocode | No | The paper describes models using mathematical equations but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | The code for training and evaluation will be released. |
| Open Datasets | Yes | The SCBOW model uses the Toronto Book Corpus1 which contains 7,087 books collected from the web.1The corpus can be downloaded from http://www.cs. toronto.edu/ mbweb/. ... Then we train the model using Ada Delta for one epoch with the initial learning rate of 0.001 and the batch size of 100. For the PP model, phrases in the training dataset are not sentences or even constituents, causing worse tagging results. Hence we use the SICK data set, which consists of 10,000 English sentence pairs with human annotation, to train the attention models after training the PP model5. The attention models are trained by Ada Grad for ten epochs with initial learning rate of 0.05. ... the pre-trained word embeddings4.4The word embeddings is available at https://github. com/mmihaltz/word2vec-Google News-vectors.5The trained model is available at http://ttic. uchicago.edu/ wieting/ |
| Dataset Splits | No | The paper describes training and test data, but does not explicitly define a validation dataset split with percentages or counts. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions software tools like 'Stanford POS tagger' and 'C&C tool', and optimizers like 'Ada Delta' and 'Ada Grad', but does not specify their version numbers. |
| Experiment Setup | Yes | In all the models, we randomly initialize the POS tag and CCG supertag vectors with 300-dimension vectors, by drawing from a normal distribution with µ = 0.0 and σ = 0.01. In the ATT-SUR model, the Surprisal is calculated by a state-of-the-art largescale neural language model released by [Jozefowicz et al., 2016]. Moreover, we also train a 5 order n-gram language model with modified Kneser-Ney smoothing, but the performance is slightly worse. Hence we only report the results of the neural language model. In the experiment, we set surprisal value x as min(max(0, x), 10). In the SCBOW model, we use two negative examples and initialize the embedding layer with the pre-trained word embeddings. Then we train the model using Ada Delta for one epoch with the initial learning rate of 0.001 and the batch size of 100. For the PP model, phrases in the training dataset are not sentences or even constituents, causing worse tagging results. Hence we use the SICK data set, which consists of 10,000 English sentence pairs with human annotation, to train the attention models after training the PP model. The attention models are trained by Ada Grad for ten epochs with initial learning rate of 0.05. |