Self-paced Compensatory Deep Boltzmann Machine for Semi-Structured Document Embedding

Authors: Shuangyin Li, Rong Pan, Jun Yan

IJCAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In the experiment section, we will present the experimental results on two corpora, Wikipedia and IMDB, to show the performance of the proposed model on semi-structured document embedding for document classification and retrieval and tag prediction when comparing with state-of-the-art baselines.
Researcher Affiliation Collaboration Shuangyin Li i PIN, Shenzhen, China. shuangyinli@ipin.com Rong Pan School of Data and Computer Science, Sun Yat-sen University, China. panr@sysu.edu.cn Jun Yan Microsoft Research Asia. junyan@microsoft.com
Pseudocode No The paper describes various algorithms and procedures, such as 'Contrastive Divergence (CD) algorithm' and 'alternating Gibbs sampling', but it does not present them in structured pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statement about releasing source code or provide a link to a code repository.
Open Datasets No The paper states 'The first dataset is from Wikipedia' and 'The second corpus is the data from Internet Movie Database (IMDB)', but it does not provide a specific link, DOI, repository name, or a formal citation with authors/year for the exact versions of these datasets used, as required for concrete access information.
Dataset Splits Yes Figure 2: Classification results on the Wikipedia (a) and IMDB (b) for RSM, LDA, TWTM and SCDBM with 5-fold cross-validation. and The corpus was randomly divided into two part: 80% the database documents and 20% the query documents.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions various software components and models like 'LDA', 'RSM', 'TWTM', and 'LIBSVM' but does not specify their version numbers for reproducibility.
Experiment Setup Yes For the RSM model, we subdivided datasets into minibatches, and each contains 100 training cases, and updated the parameters after each minibatches to speed-up learning in the same way as RSM. Moreover, we choose the 2,500 most frequent words as the feature representations in the training dataset. With the three models, we embedded the documents into 300-dimensional latent representations. When training the proposed SCDBM, we first embedded the metadata of the two corpora into low dimension using RBMs. For Wikipedia, we let Lem(1) = 300, and Lem(2) = 300. For IMDB, we let Lem(1) = 300, Lem(2) = 300 and Lem(3) = 6. To model the word count vector, we trained LDAs with 300 topics on the two corpora, and treated the latent topic distributions as ew at the bottom level of our model. The dropout rate is set to 0.5 as described in [Srivastava et al., 2014] when training our model. Since the number of types of metadata in each corpus is small, the starting value of v is becomes critical. Thus, after tried some methods to initialize the v, we use a logarithmic scheme to initialize the v as shown as in [Jiang et al., 2015] to get the best results.