Localized Centering: Reducing Hubness in Large-Sample Data

Authors: Kazuo Hara, Ikumi Suzuki, Masashi Shimbo, Kei Kobayashi, Kenji Fukumizu, Miloš Radovanović

AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results using synthetic data indicate that localized centering reduces the hubness not suppressed by classical centering. Using large real-world datasets, moreover, we show that the proposed method improves the performance of document classification with k NN insofar as it reduces hubness.
Researcher Affiliation Academia kazuo.hara@gmail.com National Institute of Genetics Mishima, Shizuoka, Japan; Ikumi Suzuki suzuki.ikumi@gmail.com National Institute of Genetics Mishima, Shizuoka, Japan; Masashi Shimbo shimbo@is.naist.jp Nara Institute of Science and Technology Ikoma, Nara, Japan; Kei Kobayashi kei@ism.ac.jp The Institute of Statistical Mathematics Tachikawa, Tokyo, Japan; Kenji Fukumizu fukumizu@ism.ac.jp The Institute of Statistical Mathematics Tachikawa, Tokyo, Japan; Miloˇs Radovanovi c radacha@dmi.uns.ac.rs University of Novi Sad Novi Sad, Serbia
Pseudocode No The paper describes mathematical formulations and processes but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper mentions using a MATLAB script (norm mp empiric.m distributed at http://ofai.at/ dominik.schnitzer/mp) for a baseline method (Mutual Proximity), but does not provide or state that the source code for their proposed method (Localized Centering) is available.
Open Datasets Yes The datasets are: Web KB, Reuters-52, and 20Newsgroups, all preprocessed and distributed by Cardoso-Cachopo (2007), and TDT2-30 distributed by Cai, He, and Han (2005). ... Cai, D.; He, X.; and Han, J. 2005. Document clustering using locality preserving indexing. IEEE Transactions on Knowledge and Data Engineering 17(12):1624 1637. Datasets available at http://www.cad.zju.edu.cn/home/dengcai/Data/Text Data.html. Cardoso-Cachopo, A. 2007. Improving Methods for Singlelabel Text Categorization. Phd thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa. Datasets available at http://web.ist.utl.pt/acardoso/datasets/.
Dataset Splits Yes To simulate a situation in which the number of training samples is large, we ignored the predefined training-test splits provided with the datasets. Instead, the performance was evaluated by the accuracy of the leave-one-out cross validation over all samples.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies No The paper mentions using a "MATLAB script" for one of the methods (MP), but it does not specify any software versions (e.g., MATLAB version, or versions of any libraries or frameworks used).
Experiment Setup Yes We represented each document as a tf-idf weighted bag-of-word vector normalized to unit length. Throughout the experiment, inner product is used as the measure of similarity. ... The parameter κ in Local Affinity can be different from the parameter k of the k NN classification performed subsequently. Indeed, in later experiments, we will tune κ so as to maximize the correlation with the N10 skewness, independently from the k NN classification. ... Parameter γ can be tuned so as to maximally reduce the skewness of the Nk distribution.