reproducibilityindex.ai

Entropy Rate Estimation for Markov Chains with Large State Space

Authors: Yanjun Han, Jiantao Jiao, Chuan-Zheng Lee, Tsachy Weissman, Yihong Wu, Tiancheng Yu

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In addition to synthetic experiments, we also apply the estimators that achieve the optimal sample complex- ity to estimate the entropy rate of the English language in the Penn Treebank and the Google One Billion Words corpora, which provides a natural benchmark for language modeling and relates it directly to the widely used perplexity measure. We compare the empirical performance of various estimators for entropy rate on a variety of synthetic data sets, and demonstrate the superior performances of the informationtheoretically optimal estimators compared to the empirical entropy rate. We apply the information-theoretically optimal estimators to estimate the entropy rate of the Penn Treebank (PTB) and the Google One Billion Words (1BW) datasets.
Researcher Affiliation	Academia	Yanjun Han Department of Electrical Engineering Stanford University Stanford, CA 94305 yjhan@stanford.edu Jiantao Jiao Department of Electrical Engineering and Computer Sciences University of California, Berkeley Berkeley, CA 94720 jiantao@berkeley.edu Chuan-Zheng Lee, Tsachy Weissman Department of Electrical Engineering Stanford University Stanford, CA 94305 {czlee, tsachy}@stanford.edu Yihong Wu Department of Statistics and Data Science Yale University New Haven, CT 06511 yihong.wu@yale.edu Tiancheng Yu Department of Electronic Engineering Tsinghua University Haidian, Beijing 100084 thueeyutc14@foxmail.com
Pseudocode	No	No pseudocode or algorithm blocks are present in the paper.
Open Source Code	No	The paper does not provide any statement or link regarding open-source code for the described methodology.
Open Datasets	Yes	We used two well-known linguistic corpora: the Penn Treebank (PTB) and Google s One Billion Words (1BW) benchmark.
Dataset Splits	No	The paper does not explicitly provide specific percentages or counts for training, validation, and test dataset splits, nor does it refer to predefined splits with citations.
Hardware Specification	No	The paper does not specify any hardware details (e.g., GPU/CPU models, processor types, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper does not provide specific software dependency details, such as library or solver names with version numbers.
Experiment Setup	No	The paper discusses the approach and the use of k-grams, but it does not provide specific experimental setup details such as concrete hyperparameter values or detailed training configurations.