Convolutional Neural Networks for Text Hashing

Authors: Jiaming Xu, Peng Wang, Guanhua Tian, Bo Xu, Jun Zhao, Fangyuan Wang, Hongwei Hao

IJCAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show the superiority of our proposed approach over several state-of-the-art hashing methods when tested on one short text dataset as well as one normal text dataset.
Researcher Affiliation Academia Institute of Automation, Chinese Academy of Sciences. 100190, Beijing, P.R. China National Laboratory of Pattern Recognition (NLPR), Beijing, P.R. China {jiaming.xu, peng.wang, guanhua.tian, boxu, fangyuan.wang}@ia.ac.cn, jzhao@nlpr.ia.ac.cn, hongwei.hao@ia.ac.cn
Pseudocode No The paper includes mathematical equations and descriptions of the algorithm steps but does not provide a formal pseudocode or algorithm block.
Open Source Code No The paper does not provide any links to its own source code nor explicitly states that the code for their method is open-sourced or available.
Open Datasets Yes We test our algorithms on two public text datasets... Search Snippets1. This dataset was selected from the results of web search transaction using predefined phrases of 8 different domains [Phan et al., 2008]. ... 20Newsgroups. We select the popular bydata version and use the stemmed version2 pre-processed by Ana Cardoso Cachopo [2007]... 1http://jwebpro.sourceforge.net/data-web-snippets.tar.gz. 2http://web.ist.utl.pt/acardoso/datasets/. ... By default, our experiments ultilize the Glo Ve embeddings3 trained by Pennington et al. [2014] on 6 billion tokens of Wikipedia 2014 and Gigaword 5. We also give some comparisons with other word embeddings, such as Senna embeddings4 [Collobert et al., 2011]... 3http://nlp.stanford.edu/projects/glove/. 4http://ml.nec-labs.com/senna/.
Dataset Splits Yes For these datasets, we denote the category labels as tags, generate vocabulary from the training sets and randomly select 10% of the training data as the development set. ... Dataset C Train/Test L(mean/max) |V | Snippets 8 10060/2280 17.3/38 26265 20News 20 10443/6973 92.8/300 41877
Hardware Specification No The paper does not specify the hardware used for the experiments. There are no mentions of specific GPU models, CPU models, or cloud computing instance types with specifications.
Software Dependencies No The paper mentions using LDA for ITQ baseline without a version number, and various word embeddings but does not list specific software dependencies with version numbers (e.g., Python X.Y, TensorFlow A.B.C).
Experiment Setup Yes The parameter k in Equation 2 is fixed to 7 when constructing the graph Laplacians in our approach, as well as in the baseline methods, STH, STH-RBF and STHs. We set the width of the convolutional filter w as 3, the size of feature map n1 as 80, the value of K in max pooling layer as 2, the dimension of word embeddings dw as 50, the dimension of position embeddings dp as 8 and the learning rate λ as 0.01. Moreover, the feature weight α at the output layer are tuned through the grid from 0.001 to 1024. The optimal weights are α = 16 on Search Snippets and α = 128 on 20Newsgroups.