Near-Lossless Binarization of Word Embeddings

Authors: Julien Tissier, Christophe Gravier, Amaury Habrard7104-7111

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on semantic similarity, text classification and sentiment analysis tasks show that the binarization of word embeddings only leads to a loss of 2% in accuracy while vector size is reduced by 97%. Furthermore, a top-k benchmark demonstrates that using these binary vectors is 30 times faster than using real-valued vectors.
Researcher Affiliation Academia Julien Tissier, Christophe Gravier, Amaury Habrard Univ. Lyon, UJM Saint-Etienne CNRS, Lab Hubert Curien UMR 5516 42023, Saint-Etienne, France {julien.tissier, christophe.gravier, amaury.habrard}@univ-st-etienne.fr
Pseudocode No The paper describes the autoencoder architecture and mathematical formulations but does not include any pseudocode or algorithm blocks.
Open Source Code Yes Entire source code to generate and evaluate binary vectors is available online 2. https://github.com/tca19/near-lossless-binarization
Open Datasets Yes Pre-trained embeddings Our autoencoder learns binary vectors from several pre-trained embeddings: dict2vec (Tissier, Gravier, and Habrard 2017) which contains 2.3M words and has been trained on the full English Wikipedia corpus; fasttext (Bojanowski et al. 2017) which contains 1M words and has also been trained on the English Wikipedia corpus; and Glo Ve (Pennington, Socher, and Manning 2014) which contains 400k words and has been trained on both English Wikipedia and Gigaword 5 corpora.
Dataset Splits Yes Each dataset is split into a training and a test file and the same training and test files are used for all word embedding models.
Hardware Specification No The paper mentions general CPU optimizations and memory usage benefits but does not specify any particular CPU model, GPU, or other hardware used for running the experiments.
Software Dependencies No The paper does not list specific software libraries or their version numbers (e.g., TensorFlow, PyTorch, scikit-learn, with versions) that are ancillary to the research.
Experiment Setup Yes The model uses a batch size of 75, 10 epochs for dict2vec and fasttext, and 5 epochs for Glo Ve (the autoencoder converges faster due to the smaller vocabulary) and a learning rate of 0.001. The regularization hyperparameter λreg depends on the starting vectors and the binary vector size. It varies from 1 to 4 in the experiments...