Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Information-Theoretic Multi-view Domain Adaptation: A Theoretical and Empirical Study
Authors: P. Yang, W. Gao
JAIR 2014 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we empirically evaluate the IMAM algorithm for the cross-domain document classification tasks in comparison with the state-of-the-art baselines. |
| Researcher Affiliation | Academia | Pei Yang EMAIL South China University of Technology Guangzhou, China Wei Gao EMAIL Qatar Computing Research Institute Qatar Foundation, Doha, Qatar |
| Pseudocode | Yes | Algorithm 1 Algorithm for IMAM Input: Document-term matrices DS W and DT W; Document-link matrices DS L and DT L; Class label c C assigned to each doc d DS; # of document clusters (i.e., # of classes); Output: Class label assigned to each document d DT ; 1: Set t = 0. Initialize document clustering C(0) D using NBC. Initialize word clustering C(0) W and link clustering C(0) L randomly; 2: Initialize distributions q(0)(w| ˆd), q(0)(l| ˆd), q(0)(d| ˆw), q(0)(d|ˆl), q(0)(c| ˆw), q(0)(c|ˆl); 3: repeat 4: Document clustering: For each d, find its new cluster index using Eq. 4; 5: Keep q(t+1)(c| ˆw) = q(t)(c| ˆw) and q(t+1)(c|ˆl) = q(t)(c|ˆl); Update q(t+1)(w| ˆd), q(t+1)(l| ˆd), q(t+1)(d| ˆw), q(t+1)(d|ˆl); 6: Word clustering: For each word w, find its new cluster index using Eq. 5; Link clustering: For each link l, find its new cluster index using Eq. 6; 7: Update q(t+2)(w| ˆd), q(t+2)(l| ˆd), q(t+2)(d| ˆw), q(t+2)(d|ˆl), q(t+2)(c| ˆw) and q(t+2)(c|ˆl); 8: t = t + 2; 9: until no document s cluster index needs to adjust 10: for each unlabeled d DT do 11: Assign d the class label based on Eq. 7; 12: end for |
| Open Source Code | No | The paper references source code for third-party tools used for comparison (TSVM at "http://svmlight.joachims.org/" and CODA at "http://www1.cse.wustl.edu/~mchen/code/coda.tar") but does not provide a statement or link for the authors' own implementation of IMAM. |
| Open Datasets | Yes | Cora (Mc Callum, Nigam, Rennie, & Seymore, 2000) is an online archive which contains approximately 37,000 computer science research papers and over 1 million links among documents. [...] Reuters-21578 (Lewis, 2004) is widely used for the evaluation of automatic text categorization algorithms. Reuters-21578 corpus also has a hierarchical structure, which contains 5 top categories. We used the pre-processed version of the corpus that is public accessible3. (Footnote 3: http://www.cse.ust.hk/TL/dataset/Reuters.zip.) |
| Dataset Splits | Yes | Based on this dataset, we used a similar way as Dai et al. (2007a) to construct our training and test sets. For each set, we chose two top categories, one as positive class and the other as the negative. Different sub-categories were deemed as different domains. The task is defined as top category classification. For example, the subset denoted as DA-EC consists of source domain: DA 1(+), EC 1(-); and target domain: DA 2(+), EC 2(-). [...] For each algorithm, the parameters were tuned by using five-fold cross-validation on training data. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions techniques like TF-IDF and pLSA, and tools like TSVM and CODA, but does not provide specific version numbers for any software libraries or dependencies used in their own implementation. |
| Experiment Setup | Yes | Figure 2 shows the error rate curves varying with different number of word (and link) clusters on the 4 subsets: DA-EC, DA-NT, DA-OS and EC-NT. The X-axis represents the number of word (and link) clusters which is tuned from 32 to 512. According to the performance shown in the figure, we empirically set the number of word (and link) clusters to 128. [...] Figure 3 shows that the performance curves vary with different values of α. [...] in the remaining experiments, we set the value of α to 0.7. [...] We empirically set λ to 0.5 after trying 0, 0.25, 0.5, 1, 2 and 4. |