Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

PhyloVAE: Unsupervised Learning of Phylogenetic Trees via Variational Autoencoders

Authors: Tianyu Xie, David Harry Tyensoung Richman, Jiansi Gao, Frederick A Matsen, Cheng Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate Phylo VAE s robust representation learning capabilities and fast generation of phylogenetic tree topologies.
Researcher Affiliation Academia Tianyu Xie1, Harry Richman3, Jiansi Gao3, Frederick A. Matsen IV3,4, Cheng Zhang1,2, 1 School of Mathematical Sciences, Peking University 2 Center for Statistical Science, Peking University 3 Computational Biology Program, Fred Hutchinson Cancer Research Center 4 Howard Hughes Medical Institute EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1: A linear-time algorithm for tree topology encoding
Open Source Code Yes Our code is released at https://github.com/tyuxie/PhyloVAE.
Open Datasets Yes Following Hillis et al. (2005), we select five genes and the ground truth phylogenetic tree (Figure 9; 44 leaves) from the early placental mammal evolution analysis in Murphy et al. (2001). The sequence alignment under consideration comprises 290 rabies genomes (Viana et al., 2023). Finally, we assess the generative modeling performance of Phylo VAE on eight benchmark sequence sets, DS1-8, which contain biological sequences from 27 to 64 eukaryote species and are commonly considered for benchmarking tree topology density estimation and Bayesian phylogenetic inference tasks in previous works (Zhang & Matsen IV, 2018; 2019; 2024; Zhang, 2020; Mimori & Hamada, 2023; Zhou et al., 2023; Xie & Zhang, 2023; Xie et al., 2024a;b; Molén et al., 2024; Hotti et al., 2024).
Dataset Splits No For each gene, we simulate the DNA sequences with a fixed length along the ground truth tree using the corresponding evolutionary model, run a Mr Bayes chain (Ronquist et al., 2012) for one million iterations, and sample per 100 iterations in the last 100,000 iterations, to gather the posterior samples, as done in Hillis et al. (2005). These one million iterations are enough for the Mr Bayes run to converge. These 5,000 tree topologies with uniform weights constitute the training set of Phylo VAE. ... (i) for each sequence set, there are 10 replicate training sets of tree topologies which are gathered from 10 independent Mr Bayes runs until the runs have ASDSF (the standard convergence criteria used in Mr Bayes) less than 0.01 or a maximum of 100 million iterations (tree topologies are sampled every 100 iterations with the first 25% iterations discarded); (ii) for each sequence set, the ground truth of tree topologies is gathered from 10 single-chain Mr Bayes for one billion iterations (tree topologies are sampled every 1000 iterations with the first 25% iterations discarded).
Hardware Specification Yes The experiments are run on a single 2.4 GHz CPU. ... The experiments are run on a single NVIDIA RTX 2080Ti GPU.
Software Dependencies No For all experiments, Phylo VAE is implemented in Py Torch (Paszke et al., 2019). The optimizer is Adam (Kingma & Ba, 2015) with parameters (β1, β2) = (0.9, 0.999) and weight_decay = 0.0.
Experiment Setup Yes The optimizer is Adam (Kingma & Ba, 2015) with parameters (β1, β2) = (0.9, 0.999) and weight_decay = 0.0. The results are collected after 200000 iterations with batch size B = 10. ... The dimension of the latent space is set to d = 2. The generative model is a three-layer MLP with 512 hidden units and a Res Net architecture. For the inference model, the number of message passing rounds is L = 2, and both MLPµ and MLPσ are composed of a two-layer MLP with 100 hidden units. The number of particles in the multi-sample lower bound (3) is K = 32. The learning rate is set to 0.0003 at the beginning and anneals according to a cosine schedule.