Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Approximating mutual information of high-dimensional variables using learned representations

Authors: Gokul Gowri, Xiaokang Lun, Allon Klein, Peng Yin

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using several benchmarks, we show that unlike existing techniques, LMI can approximate MI well for variables with > 103 dimensions if their dependence structure is captured by low-dimensional representations.
Researcher Affiliation Academia 1Wyss Institute for Biologically Inspired Engineering 2Department of Systems Biology, Harvard University
Pseudocode Yes Algorithm 1 Estimating MI using LMI Approximation; Algorithm 2 k-nearest neighbor log density ratio estimator; Algorithm 3 Early Stopping; Algorithm 4 Generating multivariate Gaussian datasets with low-dimensional dependence structure; Algorithm 5 KSG estimator for pointwise estimates
Open Source Code Yes Code availability The code necessary to reproduce all results from this paper are available at https://github.com/ggdna/latent-mutual-information. The lmi Python package can be found at https://github.com/ggdna/latentmi, and its documentation is hosted at https://latentmi.readthedocs.io.
Open Datasets Yes We resample two different source datasets: (1) binary subset of MNIST, containing only images of 0s and 1s, with 5000 samples and 784 dimensions and (2) embeddings of a subset of protein sequences from E. coli and A. thaliana proteins, with 4402 samples and 1024 dimensions. ... We study a previously published LT-seq data set of in vitro differentiating mouse hematopoietic stem cells [42].
Dataset Splits Yes That is, for N joint samples, we train the network using a subset of N/2 samples, then estimate MI by applying the estimator of [10] to latent representations of the remaining N/2 samples. ... They are trained with batch size of 512, with 1 : 1 train-validation splits, and a maximum of 300 epochs using early stopping procedure provided in Algorithm 3.
Hardware Specification Yes All experiments in this paper were done using a single NVIDIA RTX 3090.
Software Dependencies No The paper mentions that 'All models are implemented in Pytorch [52]' but does not provide specific version numbers for Pytorch or any other software dependencies like Python.
Experiment Setup Yes For a variable with dimensionality d, the encoder has hidden layer sizes L, L/2, L/4 with L = max(2 log2(d) , 1024). Decoders have the same structure inverted, with hidden layer sizes L/4, L/2, L. All MLP activations used are Leaky Re LUs, with negative slope 0.2, except the last layers of decoders, which have no activation. Cross-decoders are trained with 50% dropout after each activation layer. All weights are initialized using Xavier uniform initialization [51], and optimized using Adam, with hyperparameters listed in Table 1. They are trained with batch size of 512, with 1 : 1 train-validation splits, and a maximum of 300 epochs using early stopping procedure provided in Algorithm 3.