Most genes are composed of multiple domains with a common evolutionary history that typically perform a specific function in the resulting protein. As witnessed by many studies of key gene families, it is important to understand how domains have been duplicated, lost, transferred between genes, and rearranged. Similarly to the case of evolutionary events affecting entire genes, these domain events have large consequences for phylogenetic reconstruction and, in addition, they create considerable obstacles for gene sequence alignment algorithms, a prerequisite for phylogenetic reconstruction.
We introduce the Domain-DLRS model, a hierarchical, generative probabilistic model containing three levels corresponding to species, genes, and domains, respectively. From a dated species tree, a gene tree is generated according to the DL model, which is a birth-death model generalized to occur in a dated tree. Then, from the dated gene tree, a pre-specified number of dated domain trees are generated using the DL model and the molecular clock is relaxed, effectively converting edge times to edge lengths. Finally, for each domain tree and its lengths, domain sequences are generated for the leaves based on a selected model of sequence evolution.
For this model, we present a MCMC based inference framework called Domain-DLRS that as input takes a dates species tree together with a multiple sequence alignment for each domain family, while it as output provids an estimated posterior distribution over reconciled gene and domain trees. By requiring aligned domains rather than genes, our framework evades the problem of aligning genes that have been exposed to domain duplications, in particular non-tandem domain duplications. We show that Domain-DLRS performs better than MrBayes on synthetic data and that it outperforms MrBayes on biological data. We analyse several zinc-finger genes and show that most domain duplications have been tandem duplications, of which some have involved two or more domains, but non-tandem duplications have also been common, in particular in gene families of complex evolutionary history such as PRDM9.
Probabilistic Modeling, Domain Evolution, Bayesian Inference, Domain Tree Reconstruction