reproducibilityindex.ai

Hierarchical Optimal Transport for Document Representation

Authors: Mikhail Yurochkin, Sebastian Claici, Edward Chien, Farzaneh Mirzazadeh, Justin M. Solomon

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our technique for k-NN classiﬁcation and show better interpretability and scalability with comparable performance to current methods at a fraction of the cost.We test against existing metrics on k-NN classiﬁcation and show that it outperforms others on average.5 Experiments We present timings for metric computation and consider applications where distance between documents plays a crucial role: k-NN classiﬁcation, low-dimensional visualization, and link prediction.
Researcher Affiliation	Collaboration	IBM Research,1 MIT CSAIL,2 MIT-IBM Watson AI Lab3
Pseudocode	No	No pseudocode or clearly labeled algorithm block was found in the paper.
Open Source Code	Yes	1Code: https://github.com/IBM/HOTT
Open Datasets	Yes	We consider 8 document classiﬁcation datasets: BBC sports news articles (BBCSPORT) labeled by sport; tweets labeled by sentiments (TWITTER) (Sanders, 2011); Amazon reviews labeled by category (AMAZON); Reuters news articles labeled by topic (REUTERS) (we use the 8-class version and train-test split of Cachopo et al. (2007)); medical abstracts labeled by cardiovascular disease types (OHSUMED) (using 10 classes and train-test split as in Kusner et al. (2015)); sentences from scientiﬁc articles labeled by publisher (CLASSIC); newsgroup posts labeled by category (20NEWS), with by-date train-test split and removing headers, footers and quotes;2 and Project Gutenberg full-length books from 142 authors (GUTENBERG) using the author names as classes and 80/20 train-test split in the order of document appearance. For GUTENBERG, we reduced the vocabulary to the most common 15000 words. For 20NEWS, we removed words appearing in 5 documents. 2https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html
Dataset Splits	Yes	REUTERS (we use the 8-class version and train-test split of Cachopo et al. (2007)); medical abstracts labeled by cardiovascular disease types (OHSUMED) (using 10 classes and train-test split as in Kusner et al. (2015)); ... and Project Gutenberg full-length books from 142 authors (GUTENBERG) using the author names as classes and 80/20 train-test split in the order of document appearance.
Hardware Specification	Yes	All distance computations were implemented in Python 3.7 and run on an Intel i7-6700K at 4GHz with 32GB of RAM.
Software Dependencies	Yes	All distance computations were implemented in Python 3.7 and run on an Intel i7-6700K at 4GHz with 32GB of RAM. ... Every instance of the OT linear program is solved using Gurobi (Gurobi Optimization, 2018).
Experiment Setup	Yes	During training, we ﬁt LDA with 70 topics using a Gibbs sampler (Grifﬁths & Steyvers, 2004). Topics are truncated to the 20 most heavily-weighted words and renormalized. ... When computing HOTT between a pair of documents we truncate topic proportions at 1/(\|T\| + 1) and renormalize.