Entropy Rate Estimation for Markov Chains with Large State Space

Authors: Yanjun Han, Jiantao Jiao, Chuan-Zheng Lee, Tsachy Weissman, Yihong Wu, Tiancheng Yu

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In addition to synthetic experiments, we also apply the estimators that achieve the optimal sample complex- ity to estimate the entropy rate of the English language in the Penn Treebank and the Google One Billion Words corpora, which provides a natural benchmark for language modeling and relates it directly to the widely used perplexity measure. We compare the empirical performance of various estimators for entropy rate on a variety of synthetic data sets, and demonstrate the superior performances of the informationtheoretically optimal estimators compared to the empirical entropy rate. We apply the information-theoretically optimal estimators to estimate the entropy rate of the Penn Treebank (PTB) and the Google One Billion Words (1BW) datasets.
Researcher Affiliation Academia Yanjun Han Department of Electrical Engineering Stanford University Stanford, CA 94305 yjhan@stanford.edu Jiantao Jiao Department of Electrical Engineering and Computer Sciences University of California, Berkeley Berkeley, CA 94720 jiantao@berkeley.edu Chuan-Zheng Lee, Tsachy Weissman Department of Electrical Engineering Stanford University Stanford, CA 94305 {czlee, tsachy}@stanford.edu Yihong Wu Department of Statistics and Data Science Yale University New Haven, CT 06511 yihong.wu@yale.edu Tiancheng Yu Department of Electronic Engineering Tsinghua University Haidian, Beijing 100084 thueeyutc14@foxmail.com
Pseudocode No No pseudocode or algorithm blocks are present in the paper.
Open Source Code No The paper does not provide any statement or link regarding open-source code for the described methodology.
Open Datasets Yes We used two well-known linguistic corpora: the Penn Treebank (PTB) and Google s One Billion Words (1BW) benchmark.
Dataset Splits No The paper does not explicitly provide specific percentages or counts for training, validation, and test dataset splits, nor does it refer to predefined splits with citations.
Hardware Specification No The paper does not specify any hardware details (e.g., GPU/CPU models, processor types, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper does not provide specific software dependency details, such as library or solver names with version numbers.
Experiment Setup No The paper discusses the approach and the use of k-grams, but it does not provide specific experimental setup details such as concrete hyperparameter values or detailed training configurations.