Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Near-Lossless Binarization of Word Embeddings
Authors: Julien Tissier, Christophe Gravier, Amaury Habrard7104-7111
AAAI 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on semantic similarity, text classification and sentiment analysis tasks show that the binarization of word embeddings only leads to a loss of 2% in accuracy while vector size is reduced by 97%. Furthermore, a top-k benchmark demonstrates that using these binary vectors is 30 times faster than using real-valued vectors. |
| Researcher Affiliation | Academia | Julien Tissier, Christophe Gravier, Amaury Habrard Univ. Lyon, UJM Saint-Etienne CNRS, Lab Hubert Curien UMR 5516 42023, Saint-Etienne, France EMAIL |
| Pseudocode | No | The paper describes the autoencoder architecture and mathematical formulations but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Entire source code to generate and evaluate binary vectors is available online 2. https://github.com/tca19/near-lossless-binarization |
| Open Datasets | Yes | Pre-trained embeddings Our autoencoder learns binary vectors from several pre-trained embeddings: dict2vec (Tissier, Gravier, and Habrard 2017) which contains 2.3M words and has been trained on the full English Wikipedia corpus; fasttext (Bojanowski et al. 2017) which contains 1M words and has also been trained on the English Wikipedia corpus; and Glo Ve (Pennington, Socher, and Manning 2014) which contains 400k words and has been trained on both English Wikipedia and Gigaword 5 corpora. |
| Dataset Splits | Yes | Each dataset is split into a training and a test file and the same training and test files are used for all word embedding models. |
| Hardware Specification | No | The paper mentions general CPU optimizations and memory usage benefits but does not specify any particular CPU model, GPU, or other hardware used for running the experiments. |
| Software Dependencies | No | The paper does not list specific software libraries or their version numbers (e.g., TensorFlow, PyTorch, scikit-learn, with versions) that are ancillary to the research. |
| Experiment Setup | Yes | The model uses a batch size of 75, 10 epochs for dict2vec and fasttext, and 5 epochs for Glo Ve (the autoencoder converges faster due to the smaller vocabulary) and a learning rate of 0.001. The regularization hyperparameter λreg depends on the starting vectors and the binary vector size. It varies from 1 to 4 in the experiments... |