Hierarchical Optimal Transport for Document Representation

Authors: Mikhail Yurochkin, Sebastian Claici, Edward Chien, Farzaneh Mirzazadeh, Justin M. Solomon

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our technique for k-NN classification and show better interpretability and scalability with comparable performance to current methods at a fraction of the cost.We test against existing metrics on k-NN classification and show that it outperforms others on average.5 Experiments We present timings for metric computation and consider applications where distance between documents plays a crucial role: k-NN classification, low-dimensional visualization, and link prediction.
Researcher Affiliation Collaboration IBM Research,1 MIT CSAIL,2 MIT-IBM Watson AI Lab3
Pseudocode No No pseudocode or clearly labeled algorithm block was found in the paper.
Open Source Code Yes 1Code: https://github.com/IBM/HOTT
Open Datasets Yes We consider 8 document classification datasets: BBC sports news articles (BBCSPORT) labeled by sport; tweets labeled by sentiments (TWITTER) (Sanders, 2011); Amazon reviews labeled by category (AMAZON); Reuters news articles labeled by topic (REUTERS) (we use the 8-class version and train-test split of Cachopo et al. (2007)); medical abstracts labeled by cardiovascular disease types (OHSUMED) (using 10 classes and train-test split as in Kusner et al. (2015)); sentences from scientific articles labeled by publisher (CLASSIC); newsgroup posts labeled by category (20NEWS), with by-date train-test split and removing headers, footers and quotes;2 and Project Gutenberg full-length books from 142 authors (GUTENBERG) using the author names as classes and 80/20 train-test split in the order of document appearance. For GUTENBERG, we reduced the vocabulary to the most common 15000 words. For 20NEWS, we removed words appearing in 5 documents. 2https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html
Dataset Splits Yes REUTERS (we use the 8-class version and train-test split of Cachopo et al. (2007)); medical abstracts labeled by cardiovascular disease types (OHSUMED) (using 10 classes and train-test split as in Kusner et al. (2015)); ... and Project Gutenberg full-length books from 142 authors (GUTENBERG) using the author names as classes and 80/20 train-test split in the order of document appearance.
Hardware Specification Yes All distance computations were implemented in Python 3.7 and run on an Intel i7-6700K at 4GHz with 32GB of RAM.
Software Dependencies Yes All distance computations were implemented in Python 3.7 and run on an Intel i7-6700K at 4GHz with 32GB of RAM. ... Every instance of the OT linear program is solved using Gurobi (Gurobi Optimization, 2018).
Experiment Setup Yes During training, we fit LDA with 70 topics using a Gibbs sampler (Griffiths & Steyvers, 2004). Topics are truncated to the 20 most heavily-weighted words and renormalized. ... When computing HOTT between a pair of documents we truncate topic proportions at 1/(|T| + 1) and renormalize.