Hierarchical Optimal Transport for Document Representation
Authors: Mikhail Yurochkin, Sebastian Claici, Edward Chien, Farzaneh Mirzazadeh, Justin M. Solomon
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our technique for k-NN classification and show better interpretability and scalability with comparable performance to current methods at a fraction of the cost.We test against existing metrics on k-NN classification and show that it outperforms others on average.5 Experiments We present timings for metric computation and consider applications where distance between documents plays a crucial role: k-NN classification, low-dimensional visualization, and link prediction. |
| Researcher Affiliation | Collaboration | IBM Research,1 MIT CSAIL,2 MIT-IBM Watson AI Lab3 |
| Pseudocode | No | No pseudocode or clearly labeled algorithm block was found in the paper. |
| Open Source Code | Yes | 1Code: https://github.com/IBM/HOTT |
| Open Datasets | Yes | We consider 8 document classification datasets: BBC sports news articles (BBCSPORT) labeled by sport; tweets labeled by sentiments (TWITTER) (Sanders, 2011); Amazon reviews labeled by category (AMAZON); Reuters news articles labeled by topic (REUTERS) (we use the 8-class version and train-test split of Cachopo et al. (2007)); medical abstracts labeled by cardiovascular disease types (OHSUMED) (using 10 classes and train-test split as in Kusner et al. (2015)); sentences from scientific articles labeled by publisher (CLASSIC); newsgroup posts labeled by category (20NEWS), with by-date train-test split and removing headers, footers and quotes;2 and Project Gutenberg full-length books from 142 authors (GUTENBERG) using the author names as classes and 80/20 train-test split in the order of document appearance. For GUTENBERG, we reduced the vocabulary to the most common 15000 words. For 20NEWS, we removed words appearing in 5 documents. 2https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html |
| Dataset Splits | Yes | REUTERS (we use the 8-class version and train-test split of Cachopo et al. (2007)); medical abstracts labeled by cardiovascular disease types (OHSUMED) (using 10 classes and train-test split as in Kusner et al. (2015)); ... and Project Gutenberg full-length books from 142 authors (GUTENBERG) using the author names as classes and 80/20 train-test split in the order of document appearance. |
| Hardware Specification | Yes | All distance computations were implemented in Python 3.7 and run on an Intel i7-6700K at 4GHz with 32GB of RAM. |
| Software Dependencies | Yes | All distance computations were implemented in Python 3.7 and run on an Intel i7-6700K at 4GHz with 32GB of RAM. ... Every instance of the OT linear program is solved using Gurobi (Gurobi Optimization, 2018). |
| Experiment Setup | Yes | During training, we fit LDA with 70 topics using a Gibbs sampler (Griffiths & Steyvers, 2004). Topics are truncated to the 20 most heavily-weighted words and renormalized. ... When computing HOTT between a pair of documents we truncate topic proportions at 1/(|T| + 1) and renormalize. |