Here we describe the SweGen data set, a comprehensive map of genetic variation in the Swedish population. These data represent a basic resource for clinical genetics laboratories as well as for sequencing-based association studies by providing information on genetic variant frequencies in a cohort that is well matched to national patient cohorts. To select samples for this study, we first examined the genetic structure of the Swedish population using high-density SNP-array data from a nation-wide cohort of over 10 000 Swedish-born individuals included in the Swedish Twin Registry. A total of 1000 individuals, reflecting a cross-section of the population and capturing the main genetic structure, were selected for whole-genome sequencing. Analysis pipelines were developed for automated alignment, variant calling and quality control of the sequencing data. This resulted in a genome-wide collection of aggregated variant frequencies in the Swedish population that we have made available to the scientific community through the website https://swefreq.nbis.se. A total of 29.2 million single-nucleotide variants and 3.8 million indels were detected in the 1000 samples, with 9.9 million of these variants not present in current databases. Each sample contributed with an average of 7199 individual-specific variants. In addition, an average of 8645 larger structural variants (SVs) were detected per individual, and we demonstrate that the population frequencies of these SVs can be used for efficient filtering analyses. Finally, our results show that the genetic diversity within Sweden is substantial compared with the diversity among continental European populations, underscoring the relevance of establishing a local reference data set.
In genetics, with increasing data sizes and more advanced algorithms for mining complex data, a point is reached where increased computational capacity or alternative solutions becomes unavoidable. Most contemporary methods for linkage analysis are based on the Lander-Green hidden Markov model (HMM), which scales exponentially with the number of pedigree members. In whole genome linkage analysis, genotype simulations become prohibitively time consuming to perform on single computers. We have developed 'Grid-Allegro', a Grid aware implementation of the Allegro software, by which several thousands of genotype simulations can be performed in parallel in short time. With temporary installations of the Allegro executable and datasets on remote nodes at submission, the need of predefined Grid run-time environments is circumvented. We evaluated the performance, efficiency and scalability of this implementation in a genome scan on Swedish multiplex Alzheimer's disease families. We demonstrate that 'Grid-Allegro' allows for the full exploitation of the features available in Allegro for genome-wide linkage. The implementation of existing bioinformatics applications on Grids (Distributed Computing) represent a cost-effective alternative for addressing highly resource-demanding and data-intensive bioinformatics task, compared to acquiring and setting up clusters of computational hardware in house (Parallel Computing), a resource not available to most geneticists today.
Compartmentalization of biological reactions is an important mechanism to allow multiple cellular reactions to occur in parallel. Resolving the spatial distribution of the human proteome at a subcellular level increases our understanding of human biology and disease. We have generated a high-resolution map of the subcellular distribution of the human proteome as part of the open access Human Protein Atlas database. We have shown that as much as half of all proteins localize to multiple compartments. Such proteins may have context specific functions and ‘moonlight’ in different parts of the cell, thus increasing the functionality of the proteome and the complexity of the cell from a systems perspective. I will present how this spatial data can complement quantitative omics data for improved functional read-out. Furthermore, I will present unpublished data on the extent of single cell variations of the human proteome, in correlation to cell cycle progression and other deterministic factors, as well as the overlap with observed variations at the RNA level. In summary, I will demonstrate the importance of spatial proteomics data for improved single cell biology.
Rare genetic diseases are caused by different types of genetic variants, from single nucleotide variants (SNVs) to large chromosomal rearrangements. Recent data indicates that whole genome sequencing (WGS) may be used as a comprehensive test to identify multiple types of pathologic genetic aberrations in a single analysis.
We present FindSV, a bioinformatic pipeline for detection of balanced (inversions and translocations) and unbalanced (deletions and duplications) structural variants (SVs). First, FindSV was tested on 106 validated deletions and duplications with a median size of 850 kb (min: 511 bp, max: 155 Mb). All variants were detected. Second, we demonstrated the clinical utility in 138 monogenic WGS panels. SV analysis yielded 11 diagnostic findings (8%). Remarkably, a complex structural rearrangement involving two clustered deletions disrupting SCN1A, SCN2A, and SCN3A was identified in a three months old girl with epileptic encephalopathy. Finally, 100 consecutive samples referred for clinical microarray were also analyzed by WGS. The WGS data was screened for large (>2 kbp) SVs genome wide, processed for visualization in our clinical routine arrayCGH workflow with the newly developed tool vcf2cytosure, and for exonic SVs and SNVs in a panel of 700 genes linked to intellectual disability. We also applied short tandem repeat (STR) expansion detection and discovered one pathologic expansion in ATXN7. The diagnostic rate (29%) was doubled compared to clinical microarray (12%).
In conclusion, using WGS we have detected a wide range of structural variation with high accuracy, confirming it a powerful comprehensive genetic test in a clinical diagnostic laboratory setting.
Alzheimer's disease (AD) is a neurodegenerative disease that affects approximately 20 million persons all over the world. There are both sporadic and familial forms of AD. We have previously reported a genome-wide linkage analysis on 71 Swedish AD families using 365 genotyped microsatellite markers. In this study, we increased the number of individuals included in the original 71 analysed families besides adding 38 new families. These 109 families were genotyped for 1100 novel microsatellite markers. The present study reports on the linkage data generated from the non-overlapping genotypes from the first genome scan and the genotypes of the present scan, which results in a total of 1289 successfully genotyped markers at an average density of 2.85 cM on 468 individuals from 109 AD families. Non-parametric linkage analysis yielded a significant multipoint LOD score in chromosome 19q13, the region harbouring the major susceptibility gene APOE, both for the whole set of families (LOD = 5.0) and the APOE epsilon 4-positive subgroup made up of 63 families (LOD = 5.3). Other suggestive linkage peaks that were observed in the original genome scan of 71 Swedish AD families were not detected in this extended analysis, and the previously reported linkage signals in chromosomes 9, 10 and 12 were not replicated.
In an attempt to map chromosomal regions carrying rare gene variants contributing to the risk of multiple sclerosis (MS), we identified segments shared identical-by-descent (IBD) using the software BEAGLE 4.0's refined IBD analysis. IBD mapping aims at identifying segments inherited from a common ancestor and shared more frequently in case-case pairs. A total of 2106 MS patients of Nordic origin and 624 matched controls were genotyped on Illumina Human Quad 660 chip and an additional 1352 ethnically matched controls typed on Illumina HumanHap 550 and Illumina 1M were added. The quality control left a total of 441 731 markers for the analysis. After identification of segments shared by descent and significance testing, a filter function for markers with low IBD sharing was applied. Four regions on chromosomes 5, 9, 14 and 19 were found to be significantly associated with the risk for MS. However, all markers but for one were located telomerically, including the very distal markers. For methodological reasons, such segments have a low sharing of IBD signals and are prone to be false positives. One marker on chromosome 19 reached genome-wide significance and was not one of the distal markers. This marker was located within the GNA11 gene, which contains no previous association with MS. We conclude that IBD mapping is not sufficiently powered to identify MS risk loci even in ethnically relatively homogenous populations, or that alternatively rare variants are not adequately present.