reproducibilityindex.ai

Near-Lossless Binarization of Word Embeddings

Authors: Julien Tissier, Christophe Gravier, Amaury Habrard7104-7111

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on semantic similarity, text classiﬁcation and sentiment analysis tasks show that the binarization of word embeddings only leads to a loss of 2% in accuracy while vector size is reduced by 97%. Furthermore, a top-k benchmark demonstrates that using these binary vectors is 30 times faster than using real-valued vectors.
Researcher Affiliation	Academia	Julien Tissier, Christophe Gravier, Amaury Habrard Univ. Lyon, UJM Saint-Etienne CNRS, Lab Hubert Curien UMR 5516 42023, Saint-Etienne, France {julien.tissier, christophe.gravier, amaury.habrard}@univ-st-etienne.fr
Pseudocode	No	The paper describes the autoencoder architecture and mathematical formulations but does not include any pseudocode or algorithm blocks.
Open Source Code	Yes	Entire source code to generate and evaluate binary vectors is available online 2. https://github.com/tca19/near-lossless-binarization
Open Datasets	Yes	Pre-trained embeddings Our autoencoder learns binary vectors from several pre-trained embeddings: dict2vec (Tissier, Gravier, and Habrard 2017) which contains 2.3M words and has been trained on the full English Wikipedia corpus; fasttext (Bojanowski et al. 2017) which contains 1M words and has also been trained on the English Wikipedia corpus; and Glo Ve (Pennington, Socher, and Manning 2014) which contains 400k words and has been trained on both English Wikipedia and Gigaword 5 corpora.
Dataset Splits	Yes	Each dataset is split into a training and a test ﬁle and the same training and test ﬁles are used for all word embedding models.
Hardware Specification	No	The paper mentions general CPU optimizations and memory usage benefits but does not specify any particular CPU model, GPU, or other hardware used for running the experiments.
Software Dependencies	No	The paper does not list specific software libraries or their version numbers (e.g., TensorFlow, PyTorch, scikit-learn, with versions) that are ancillary to the research.
Experiment Setup	Yes	The model uses a batch size of 75, 10 epochs for dict2vec and fasttext, and 5 epochs for Glo Ve (the autoencoder converges faster due to the smaller vocabulary) and a learning rate of 0.001. The regularization hyperparameter λreg depends on the starting vectors and the binary vector size. It varies from 1 to 4 in the experiments...