Entropy Rate Estimation for Markov Chains with Large State Space
Authors: Yanjun Han, Jiantao Jiao, Chuan-Zheng Lee, Tsachy Weissman, Yihong Wu, Tiancheng Yu
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In addition to synthetic experiments, we also apply the estimators that achieve the optimal sample complex- ity to estimate the entropy rate of the English language in the Penn Treebank and the Google One Billion Words corpora, which provides a natural benchmark for language modeling and relates it directly to the widely used perplexity measure. We compare the empirical performance of various estimators for entropy rate on a variety of synthetic data sets, and demonstrate the superior performances of the informationtheoretically optimal estimators compared to the empirical entropy rate. We apply the information-theoretically optimal estimators to estimate the entropy rate of the Penn Treebank (PTB) and the Google One Billion Words (1BW) datasets. |
| Researcher Affiliation | Academia | Yanjun Han Department of Electrical Engineering Stanford University Stanford, CA 94305 yjhan@stanford.edu Jiantao Jiao Department of Electrical Engineering and Computer Sciences University of California, Berkeley Berkeley, CA 94720 jiantao@berkeley.edu Chuan-Zheng Lee, Tsachy Weissman Department of Electrical Engineering Stanford University Stanford, CA 94305 {czlee, tsachy}@stanford.edu Yihong Wu Department of Statistics and Data Science Yale University New Haven, CT 06511 yihong.wu@yale.edu Tiancheng Yu Department of Electronic Engineering Tsinghua University Haidian, Beijing 100084 thueeyutc14@foxmail.com |
| Pseudocode | No | No pseudocode or algorithm blocks are present in the paper. |
| Open Source Code | No | The paper does not provide any statement or link regarding open-source code for the described methodology. |
| Open Datasets | Yes | We used two well-known linguistic corpora: the Penn Treebank (PTB) and Google s One Billion Words (1BW) benchmark. |
| Dataset Splits | No | The paper does not explicitly provide specific percentages or counts for training, validation, and test dataset splits, nor does it refer to predefined splits with citations. |
| Hardware Specification | No | The paper does not specify any hardware details (e.g., GPU/CPU models, processor types, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific software dependency details, such as library or solver names with version numbers. |
| Experiment Setup | No | The paper discusses the approach and the use of k-grams, but it does not provide specific experimental setup details such as concrete hyperparameter values or detailed training configurations. |