Near-Lossless Binarization of Word Embeddings
Authors: Julien Tissier, Christophe Gravier, Amaury Habrard7104-7111
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on semantic similarity, text classification and sentiment analysis tasks show that the binarization of word embeddings only leads to a loss of 2% in accuracy while vector size is reduced by 97%. Furthermore, a top-k benchmark demonstrates that using these binary vectors is 30 times faster than using real-valued vectors. |
| Researcher Affiliation | Academia | Julien Tissier, Christophe Gravier, Amaury Habrard Univ. Lyon, UJM Saint-Etienne CNRS, Lab Hubert Curien UMR 5516 42023, Saint-Etienne, France {julien.tissier, christophe.gravier, amaury.habrard}@univ-st-etienne.fr |
| Pseudocode | No | The paper describes the autoencoder architecture and mathematical formulations but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Entire source code to generate and evaluate binary vectors is available online 2. https://github.com/tca19/near-lossless-binarization |
| Open Datasets | Yes | Pre-trained embeddings Our autoencoder learns binary vectors from several pre-trained embeddings: dict2vec (Tissier, Gravier, and Habrard 2017) which contains 2.3M words and has been trained on the full English Wikipedia corpus; fasttext (Bojanowski et al. 2017) which contains 1M words and has also been trained on the English Wikipedia corpus; and Glo Ve (Pennington, Socher, and Manning 2014) which contains 400k words and has been trained on both English Wikipedia and Gigaword 5 corpora. |
| Dataset Splits | Yes | Each dataset is split into a training and a test file and the same training and test files are used for all word embedding models. |
| Hardware Specification | No | The paper mentions general CPU optimizations and memory usage benefits but does not specify any particular CPU model, GPU, or other hardware used for running the experiments. |
| Software Dependencies | No | The paper does not list specific software libraries or their version numbers (e.g., TensorFlow, PyTorch, scikit-learn, with versions) that are ancillary to the research. |
| Experiment Setup | Yes | The model uses a batch size of 75, 10 epochs for dict2vec and fasttext, and 5 epochs for Glo Ve (the autoencoder converges faster due to the smaller vocabulary) and a learning rate of 0.001. The regularization hyperparameter λreg depends on the starting vectors and the binary vector size. It varies from 1 to 4 in the experiments... |