Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Quantitative synteny scoring improves homology inference and partitioning of gene families
KTH, School of Computer Science and Communication (CSC), Computational Biology, CB. KTH, Centres, Science for Life Laboratory, SciLifeLab.
KTH, School of Computer Science and Communication (CSC), Computational Biology, CB. KTH, Centres, Science for Life Laboratory, SciLifeLab.ORCID iD: 0000-0002-6664-1607
KTH, School of Computer Science and Communication (CSC), Computational Biology, CB. KTH, Centres, Science for Life Laboratory, SciLifeLab.
Stockholms universitet.ORCID iD: 0000-0001-5341-1733
2013 (English)In: BMC Bioinformatics, ISSN 1471-2105, E-ISSN 1471-2105, Vol. 14, S12- p.Article in journal (Refereed) Published
Abstract [en]

Background: Clustering sequences into families has long been an important step in characterization of genes and proteins. There are many algorithms developed for this purpose, most of which are based on either direct similarity between gene pairs or some sort of network structure, where weights on edges of constructed graphs are based on similarity. However, conserved synteny is an important signal that can help distinguish homology and it has not been utilized to its fullest potential. Results: Here, we present GenFamClust, a pipeline that combines the network properties of sequence similarity and synteny to assess homology relationship and merge known homologs into groups of gene families. GenFamClust identifies homologs in a more informed and accurate manner as compared to similarity based approaches. We tested our method against the Neighborhood Correlation method on two diverse datasets consisting of fully sequenced genomes of eukaryotes and synthetic data. Conclusions: The results obtained from both datasets confirm that synteny helps determine homology and GenFamClust improves on Neighborhood Correlation method. The accuracy as well as the definition of synteny scores is the most valuable contribution of GenFamClust.

Place, publisher, year, edition, pages
BioMed Central, 2013. Vol. 14, S12- p.
Keyword [en]
Efficient Algorithm, Eukaryotic Genomes, Protein Families, Orthologs, Identification, Clusters, Alignment, Blast, Link
National Category
Bioinformatics (Computational Biology)
Identifiers
URN: urn:nbn:se:kth:diva-136429DOI: 10.1186/1471-2105-14-S15-S12ISI: 000328316700012OAI: oai:DiVA.org:kth-136429DiVA: diva2:676098
Conference
11th Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics, Lyon,France OCT 17-19, 2013
Funder
Swedish e‐Science Research CenterScience for Life Laboratory - a national resource center for high-throughput molecular bioscience
Note

QC 20131219

Available from: 2013-12-05 Created: 2013-12-05 Last updated: 2017-12-06Bibliographically approved
In thesis
1. From genomes to post-processing of Bayesian inference of phylogeny
Open this publication in new window or tab >>From genomes to post-processing of Bayesian inference of phylogeny
2016 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Life is extremely complex and amazingly diverse; it has taken billions of years of evolution to attain the level of complexity we observe in nature now and ranges from single-celled prokaryotes to multi-cellular human beings. With availability of molecular sequence data, algorithms inferring homology and gene families have emerged and similarity in gene content between two genes has been the major signal utilized for homology inference. Recently there has been a significant rise in number of species with fully sequenced genome, which provides an opportunity to investigate and infer homologs with greater accuracy and in a more informed way. Phylogeny analysis explains the relationship between member genes of a gene family in a simple, graphical and plausible way using a tree representation. Bayesian phylogenetic inference is a probabilistic method used to infer gene phylogenies and posteriors of other evolutionary parameters. Markov chain Monte Carlo (MCMC) algorithm, in particular using Metropolis-Hastings sampling scheme, is the most commonly employed algorithm to determine evolutionary history of genes. There are many softwares available that process results from each MCMC run, and explore the parameter posterior but there is a need for interactive software that can analyse both discrete and real-valued parameters, and which has convergence assessment and burnin estimation diagnostics specifically designed for Bayesian phylogenetic inference.

In this thesis, a synteny-aware approach for gene homology inference, called GenFamClust (GFC), is proposed that uses gene content and gene order conservation to infer homology. The feature which distinguishes GFC from earlier homology inference methods is that local synteny has been combined with gene similarity to infer homologs, without inferring homologous regions. GFC was validated for accuracy on a simulated dataset. Gene families were computed by applying clustering algorithms on homologs inferred from GFC, and compared for accuracy, dependence and similarity with gene families inferred from other popular gene family inference methods on a eukaryotic dataset. Gene families in fungi obtained from GFC were evaluated against pillars from Yeast Gene Order Browser. Genome-wide gene families for some eukaryotic species are computed using this approach.

Another topic focused in this thesis is the processing of MCMC traces for Bayesian phylogenetics inference. We introduce a new software VMCMC which simplifies post-processing of MCMC traces. VMCMC can be used both as a GUI-based application and as a convenient command-line tool. VMCMC supports interactive exploration, is suitable for automated pipelines and can handle both real-valued and discrete parameters observed in a MCMC trace. We propose and implement joint burnin estimators that are specifically applicable to Bayesian phylogenetics inference. These methods have been compared for similarity with some other popular convergence diagnostics. We show that Bayesian phylogenetic inference and VMCMC can be applied to infer valuable evolutionary information for a biological case – the evolutionary history of FERM domain.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2016. viii, 65 p.
Series
TRITA-CSC-A, ISSN 1653-5723 ; 2016:01
Keyword
Bayesian inference
National Category
Bioinformatics (Computational Biology)
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-181319 (URN)978-91-7595-849-1 (ISBN)
Public defence
2016-02-25, Fire, Tomtebodavägen 23, 171 65, Solna, 14:00 (English)
Opponent
Supervisors
Note

QC 20160201

Available from: 2016-02-01 Created: 2016-01-31 Last updated: 2016-02-01Bibliographically approved
2. Probabilistic Modelling of Domain and Gene Evolution
Open this publication in new window or tab >>Probabilistic Modelling of Domain and Gene Evolution
2016 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Phylogenetic inference relies heavily on statistical models that have been extended and refined over the past years into complex hierarchical models to capture the intricacies of evolutionary processes. The wealth of information in the form of fully sequenced genomes has led to the development of methods that are used to reconstruct the gene and species evolutionary histories in greater and more accurate detail. However, genes are composed of evolutionary conserved sequence segments called domains, and domains can also be affected by duplications, losses, and bifurcations implied by gene or species evolution. This thesis proposes an extension of evolutionary models, such as duplication-loss, rate, and substitution, that have previously been used to model gene evolution, to model the domain evolution.

In this thesis, I am proposing DomainDLRS: a comprehensive, hierarchical Bayesian method, based on the DLRS model by Åkerborg et al., 2009, that models domain evolution as occurring inside the gene and species tree. The method incorporates a birth-death process to model the domain duplications and losses along with a domain sequence evolution model with a relaxed molecular clock assumption. The method employs a variant of Markov Chain Monte Carlo technique called, Grouped Independence Metropolis-Hastings for the estimation of posterior distribution over domain and gene trees. By using this method, we performed analyses of Zinc-Finger and PRDM9 gene families, which provides an interesting insight of domain evolution.

Finally, a synteny-aware approach for gene homology inference, called GenFamClust, is proposed that uses similarity and gene neighbourhood conservation to improve the homology inference. We evaluated the accuracy of our method on synthetic and two biological datasets consisting of Eukaryotes and Fungal species. Our results show that the use of synteny with similarity is providing a significant improvement in homology inference.

Place, publisher, year, edition, pages
Stockholm, Sweden: KTH Royal Institute of Technology, 2016. 69 p.
Series
TRITA-CSC-A, ISSN 1653-5723 ; 19
Keyword
Phylogenetics, Phylogenomics, Evolution, Domain Evolution, Gene tree, Domain tree, Bayesian Inference, Markov Chain Monte Carlo, Homology Inference, Gene families, C2H2 Zinc-Finger, Reelin Protein
National Category
Bioinformatics (Computational Biology)
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-191352 (URN)978-91-7729-091-9 (ISBN)
External cooperation:
Public defence
2016-09-26, Conference room Air, SciLifeLab, Tomtebodavägen 23A, Solna, Stockholm, Stockholm, 09:00 (English)
Opponent
Supervisors
Funder
Swedish e‐Science Research CenterScience for Life Laboratory - a national resource center for high-throughput molecular bioscience
Note

QC 20160904

Available from: 2016-09-04 Created: 2016-08-29 Last updated: 2016-09-04Bibliographically approved
3. Computational Problems in Modeling Evolution and Inferring Gene Families.
Open this publication in new window or tab >>Computational Problems in Modeling Evolution and Inferring Gene Families.
2016 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Over the last few decades, phylogenetics has emerged as a very promising field, facilitating a comparative framework to explain the genetic relationships among all the living organisms on earth. These genetic relationships are typically represented by a bifurcating phylogenetic tree — the tree of life. Reconstructing a phylogenetic tree is one of the central tasks in evolutionary biology. The different evolutionary processes, such as gene duplications, gene losses, speciation, and lateral gene transfer events, make the phylogeny reconstruction task more difficult. However, with the rapid developments in sequencing technologies and availability of genome-scale sequencing data, give us the opportunity to understand these evolutionary processes in a more informed manner, and ultimately, enable us to reconstruct genes and species phylogenies more accurately. This thesis is an attempt to provide computational methods for phylogenetic inference and give tools to conduct genome-scale comparative evolutionary studies, such as detecting homologous sequences and inferring gene families.

In the first project, we present FastPhylo as a software package containing fast tools for reconstructing distance-based phylogenies. It implements the previously published efficient algorithms for estimating a distance matrix from the input sequences and reconstructing an un-rooted Neighbour Joining tree from a given distance matrix. Results on simulated datasets reveal that FastPhylo can handles hundred of thousands of sequences in a minimum time and memory efficient manner. The easy to use, well-defined interfaces, and the modular structure of FastPhylo allows it to be used in very large Bioinformatic pipelines.

In the second project, we present a synteny-aware gene homology method, called GenFamClust (GFC) that uses gene content and gene order conservation to detect homology. Results on simulated and biological datasets suggest that local synteny information combined with the sequence similarity improves the detection of homologs.

In the third project, we introduce a novel phylogeny-based clustering method, PhyloGenClust, which partitions a very large gene family into smaller subfamilies. ROC (receiver operating characteristics) analysis on synthetic datasets show that PhyloGenClust identify subfamilies more accurately. PhyloGenClust can be used as a middle tier clustering method between raw clustering methods, such as sequence similarity methods, and more sophisticated Bayesian-based phylogeny methods.

Finally, we introduce a novel probabilistic Bayesian method based on the DLTRS model, to sample reconciliations of a gene tree inside a species tree. The method uses MCMC framework to integrate LGTs, gene duplications, gene losses and sequence evolution under a relaxed molecular clock for substitution rates. The proposed sampling method estimates the posterior distribution of gene trees and provides the temporal information of LGT events over the lineages of a species tree. Analysis on simulated datasets reveal that our method performs well in identifying the true temporal estimates of LGT events. We applied our method to the genome-wide gene families for mollicutes and cyanobacteria, which gave an interesting insight into the potential LGTs highways. 

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2016. 57 p.
Series
TRITA-CSC-A, ISSN 1653-5723 ; 2016:24
Keyword
Evolution, Phylogenetics, Lateral Gene Transfer, Gene Families, Clustering
National Category
Bioinformatics (Computational Biology)
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-193637 (URN)978-91-7729-131-2 (ISBN)
Public defence
2016-10-18, Air, SciLifeLab, Tomtebodavägen 23A, Solna, 14:00 (English)
Opponent
Supervisors
Note

QC 20161010

Available from: 2016-10-10 Created: 2016-10-06 Last updated: 2016-10-10Bibliographically approved

Open Access in DiVA

fulltext(1035 kB)76 downloads
File information
File name FULLTEXT01.pdfFile size 1035 kBChecksum SHA-512
764f36655e6d6655d9d8548138389e60a6aec634f295f5d537dccb7c0db694efb3efbea20570d2ee0e6b5075857bb5df5bbe693344f51a66de7c8318e56a5731
Type fulltextMimetype application/pdf

Other links

Publisher's full textPublishes's website

Authority records BETA

Muhammad, Sayyed AuwnArvestad, Lars

Search in DiVA

By author/editor
Ali, Raja HashimMuhammad, Sayyed AuwnKhan, Mehmodd AlamArvestad, Lars
By organisation
Computational Biology, CBScience for Life Laboratory, SciLifeLab
In the same journal
BMC Bioinformatics
Bioinformatics (Computational Biology)

Search outside of DiVA

GoogleGoogle Scholar
Total: 76 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 148 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf