2. Cancer and non-coding regions
2.1. Functions attributed to non-coding regions
2.2. Examples of variants in non-coding regions known to be associated with cancer
2.3. Workflow to identify and characterize somatic variants
3. Technical issues for cancer variant identification in the non-coding regions
3.1. Sequencing and data pre-processing issues: problems and solutions
3.2. Somatic variant calling: problems and solutions
3.3. Benchmarking of called variants
4. Variant prioritization
4.1. Variant prioritization algorithms for known protein-coding regions
4.2. Variant prioritization algorithms for non-coding regions
4.1. The relative roles of exons and non-coding regions in tumorigenesis
5. Conclusion and perspectives for personalized medicine
With the cost of Whole-Genome Sequencing decreasing, non-coding regions, which were dismissed as non-functional until recently, are now being investigated to identify mutations which may drive tumorigenesis. While many variants in genes are known to be associated with cancer, few non-coding driver mutations have been identified, with a disruption of the TERT promoter being the only highly validated one. Distinguishing variants between cancer somatic cells and matched normal cells is particularly challenging because tumors are highly heterogeneous and polyploid, therefore deep and accurate sequencing as well as optimal alignment are necessary, especially for non-coding/repetitive regions. These steps help improve the precision and sensitivity of the various variant calling algorithms that exist for different categories of variants, and which have been shown to perform poorly in non-coding regions. Prioritization algorithms are useful to single out highly pathogenic mutations among tens of thousands of somatic variants. They use features such as functional annotation, sequence conservation and recurrence to rank the deleteriousness of variants; however, they seem to perform better on genes than on non-coding regions, perhaps because of the small, cumulative effects of variants in these regions. With the increasing availability of genome-wide data and the improvement of sequencing, aligning, variant calling and prioritizing methods, more non-coding driver mutations are expected to be found, which could help with personalized cancer diagnosis and therapy.
Cancer is a complex disease, with the accumulation of germline and somatic mutations resulting in dysregulation of cell growth causing tumorigenesis. Our understanding of the underlying causes of this disease is getting clearer thanks to next-generation sequencing (NGS) costs decreasing, throughput increasing in addition to both computational and storage power rising. Indeed, this has allowed researchers and clinicians to sequence thousands of matched normal-tumor tissues in cancer patients, and thus to identify millions of variants potentially involved in cancer development.
Medical research has been strongly biased towards protein-coding regions, mostly because of the prohibitive cost of Whole Genome Sequencing (WGS). However, as WGS is starting to be performed, it appears that variants predominantly map to non-coding regions, and scientists are now realising that it is likely that they have a large role in driving tumor growth. The main challenge, and current major bottleneck in variant analysis, is the interpretation of the 3 to 5 million of variants that are found in each subject. Successful identification of non-coding driver mutations could have wide implications for cancer diagnosis and treatment.
This review will assess the challenges and best practices for different steps within bioinformatics pipelines for the detection of highly deleterious somatic variants in non-coding regions.
Non-coding DNA sequences are the segments of DNA that do not encode protein sequences, and represent over 98% of the human genome sequence. Although they were previously referred to as “junk DNA”, they have been rehabilitated in 2012 by the Encyclopaedia of DNA Elements (ENCODE) project, which constituted the first systematic attempt to delineate all functional elements in the human genome (ENCODE Project Consortium, 2012). Indeed, using biochemical assays and sequence conservation, and identifying disease states correlated to mutations in non-coding regions, the authors uncovered that around 80% of the human genome is functional. Functional units are defined by the fact that in at least one cell type, they encode a product, such as a protein or non-coding RNA, or displays a biochemical signature (for example, protein binding, or a specific chromatin structure). While these results have been widely accepted, some scientists remain critical of that figure, claiming that “ENCODE’s definition of functional element identifies many sites that would not be considered functional or phenotype-determining by standard uses in biology” (Doolittle 2013). ENCODE researchers also found that many non-coding variants in individual genome sequences lie in ENCODE-annotated functional regions. It was estimated that this number is at least as large as for variants in protein-coding genes. Finally, they found that single nucleotide polymorphisms (SNPs) associated with disease by Genome-Wide Association Studies (GWAS) are enriched within or near non-coding functional elements. These discoveries emphasized the huge potential of the investigation of variants in these regions for medical research.
There are several types of non-coding sequences which have been assigned functions. Firstly, non-coding sequences called regulatory elements control gene expression. These include cis-regulatory elements, sequences located in 5’ or 3’ untranslated regions (UTRs) or within introns, and which may act as promoters, enhancers or silencers of gene expression when bound by trans-regulatory elements such as Transcription Factors (TFs). Transcribed sequences that are not translated into peptide products are also found to be involved in gene expression regulation: these include introns and several classes of non-coding RNAs (ncRNAs), such as microRNA (miRNA), transferRNA and long non-codingRNA. A recent systematic study of unannotated transcripts in combination with proteomic data in mouse neurons confirmed the presence of non-canonical translation regions, and their observed temporal regulation suggests an important functional role (Sudhakaran et al. 2014). MiRNAs in particular have been largely characterized, and it has been found in microarray analyses that one miRNA is able to reduce the expression levels of many genes through partial complementarity to one or more messenger RNA molecules, helping to define tissue-specific gene expression in humans (Lim et al. 2005).
A large fraction of the genome consists of highly repetitive DNA. Some of these regions, such as the centromere and the telomeres play critical roles in chromosomal maintenance – for example, telomeric regions at the end of chromosomes provide protection from chromosomal deterioration during DNA replication. Despite this pertinent example, there is currently no evidence that a majority of highly repetitive elements is functional; intergenic non-coding sequences without clear functions could be involved in spatial organization of genetic loci in interphase nuclei (Patrushev & Kovalenko, 2014).
As in coding sequences, non-coding regions can be disrupted by single nucleotide variants (SNVs), insertions and deletions less than 50 base pairs (indels) and larger structural variants (SVs). These variants can be found at the germline level, which means that they are inherited and found in all cells of an individual, or they can appear in somatic cells at any point in the lifetime of an individual. According to Knudson’s “two-hit” hypothesis, formulated after noting that inherited retinoblastoma occurs at a younger age than the sporadic form of the disease, the interplay between germline and somatic variants can play an important role in cancer – if a germline non-coding mutation prevents the expression of a gene on one chromosome, a somatic mutation occurring in the same region on the homologous gene could lead to a total loss of function (AG Knudson 1971). However, this paper focuses on somatic mutations and their identification, as germline variant calling methods have been better validated, as opposed to calling in tumor samples where there are additional challenges due to mixed cellularity and low-frequency sub-clonal variations.
The consequences of these mutations can be diverse depending on the location of the sequence disruption. A few examples of somatic non-coding mutations that have been found to be associated with cancer have been listed by Khurana et al (2016). The first is the gain of a TF-binding site, which is exemplified with the promoter of telomerase reverse transcriptase (TERT) which encodes the catalytic subunit of the enzyme telomerase. Telomerase lengthens the ends of chromosomes and allows cells to escape apoptosis; it has thus often been found to be overexpressed in cancerous cells. Many recent studies have reported recurrent mutations in the TERT promoter in different cancer types – these mutations create binding motifs for the ETS family of TFs, and their binding subsequently upregulates the expression of TERT. Secondly, structural rearrangements can lead to the fusion of active regulatory elements with oncogenes, as happens frequently for the 5ʹ UTR of TMPRSS2 and ETS family genes in prostate cancer. This results in an overexpression of ERG, which then inhibits Androgen Receptor signalling and subsequently malignant transformation in the prostate. Modifications in ncRNAs or their binding sites can also be a signature of cancer – the lncRNA MALAT1 seems to regulate the expression of genes that are associated with metastasis, and has been observed to be upregulated in lung cancer. Finally, transcribed pseudogenes are thought to affect and regulate their parental counterparts, as has been documented with micro RNA sequestration for the BRAF gene (involved in breast cancer).
As depicted in figure 1, to identify cancer-associated somatic variants, tumor and matched normal tissue are used. The first step is base-calling with an NGS technology, before short read alignment based upon a human reference genome. Subsequently, somatic variants can be identified by comparing tumor aligned reads to the normal tissue sequence. Driver mutations are then generally identified thanks to computational methods that aim to prioritize and interpret the functional effects of variants aggregated across samples. Finally, experimental approaches can be
used for functional validation, and variants can then be translated to the clinic for diagnostic and therapeutic purposes.
The increased realization of the importance of non-coding variants in cancer led to the ongoing collaboration between the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA), called the Pan-Cancer Analysis of Whole Genomes (PCAWG), which aims to analyze non-coding variants in around 2,500 tumor and matched normal whole genomes following the described workflow. This will allow a high-resolution analysis of different cancer types, and will help identify biological pathways targeted by driver mutations.
As mentioned previously, there are three different types of variants: SNVs, indels, and a broad range of SVs, including copy number alterations (homozygous/heterozygous deletion or gain) and translocation breakpoints between chromosomes. When trying to identify variants in noncoding regions of somatic cells for cancer studies, there are several technical considerations for the steps upstream of variant calling which will influence the accuracy of that final step. Looking at GATK Best Practices for Whole Genome somatic variant detection produced by the Broad Institute helps identify tools and measures that can be used to minimize errors in the preliminary steps, which will ultimately allow a clearer distinction between true variants and artefacts.
Firstly, tumor samples are often impure due to a mixture of tumor and normal cells, and are not completely diploid due to copy number aberrations. An increased depth of coverage is generally used to overcome these issues – a comparative study of sequencing pipelines showed that increasing sequencing depth to ~100x (two- to three-fold higher than current standards for non-cancer studies) showed a benefit so long as the tumor to control coverage ratio remained balanced (Buchhalter et al. 2014).
Furthermore, one of the main procedures which can further influence the quality of variant calling, in particular for non-coding regions, is alignment. Whole genome sequencing technologies usually produce reads of around 150 base pairs, and re-aligning non-coding regions to the 3.3 billion base pairs-long human reference genome can prove to be challenging. In addition, half of the human genome is estimated to be composed of repetitive sequences which create ambiguities in alignment and assembly, which, in turn, can produce biases and errors when interpreting results. However, paired-end sequencing has made alignment of repetitive regions substantially easier: if one end matches a non-repetitive region, it can serve as an anchor to identify where the other end, which matches the repetitive sequence, should be mapped. Burrows-Wheelers Alignment (BWA) is usually the preferred alignment method for human whole genome as it allows specific alignment in a much faster way than local alignment algorithms, using a prefix tree index from the reference genome. When mapping, it is also useful to use Picard tools which will mark and remove PCR duplicates.
Additionally, simultaneous indel realignment for both samples needs to be performed as a preliminary step if using a variant caller that does not use a haplotype assembly step (such as HaplotypeCaller or MuTect2). Local realignment serves to convert regions with misalignments due to indels into clean reads containing a consensus indel suitable for standard variant discovery approaches.
Finally, Base Quality Score Recalibration (BQSR) is another important data pre-processing step that helps with downstream variant calling. Variant calling algorithms rely heavily on the quality score assigned to individual base calls in each sequence reads, which serves to weigh the evidence that is present for or against a variant allele existing at a particular site. Unfortunately, NGS machines such as Illumina produce scores that are subject to various sources of systematic technical errors, due to the physics or chemistry of the sequencing reaction or due to manufacturing flaws in the equipment. Because of these errors, base quality scores in the read sequence FASTQ files are often over- or under-estimated. BQSR is a process in which machine learning algorithms trained on datasets of known variants are applied to model these errors empirically; the model then serves to adjust the quality scores accordingly. This is accomplished by analysing the covariation among several features of a base – usually reported quality score, position within a read and preceding and current nucleotide observed by the sequencing machine. For example, it can be observed that for a given run, whenever two C nucleotides are called in a row, the next base called had a 1% higher rate of error. In that case, any base call that comes after CC in a read will have its quality score reduced by 1% after BQSR. The result of BQSR is more accurate base qualities overall, which in turn improves the accuracy of the subsequent variant calls.
The output of the all these steps is a Sequence Alignment/Map (SAM) file, or its binary equivalent (BAM), a representation of nucleotide sequence alignments. SAM/BAM files contains for each read information such as coverage, base calling quality and alignment status (match/mismatch, insertion, deletion), which represents the input will help inform the variant calling.
The aim of somatic variant calling is to identify where deviations from the normal tissue sequence occur in a tumor sequence, and to distinguish them from machine artefacts. The output is a Variant Call File (VCF), a standard tab-delimited format in which each base is identified as a variant or as identical to the reference genome.
The two main features of variant callers that need to be maximized for accurate mutation calling, especially in heterogeneous tumor samples and in particular for noncoding regions where coverage may be low, are recall and precision. These are statistical measures of the performance of a binary classification test: sensitivity (also called the true positive rate or the recall) is the fraction of relevant instances that are retrieved, and precision (also called positive predictive value) is the fraction of retrieved instances that are relevant. These two features vary along the genomic sequence, as they depend on several criteria. Sensitivity, which reflects the algorithms ability to detect all true variants, is a function of sequencing depth and allelic fraction. Precision, on the other hand, represents the ability to dismiss bases that match the normal sample. False Positive errors can be caused by overcalling events for the tumor data, due to sequencing errors and inaccurate read placements, or undercalling true germline events in the matched normal data, due to low sequencing depth in the normal sample.
As mentioned earlier, a large part of making variant calling as accurate as possible resides in making good quality base calls, sequencing at a high depth of coverage and performing optimal mapping. It also resides in the intrinsic features of variant calling algorithms that maximize precision and sensitivity.
The variant caller MuTect (Cibulskis et al. 2013) has been validated for somatic point mutations by the ICGC-TCGA Dialogue for Reverse Engineering Assessments and Methods (DREAM) Somatic Mutation Calling Challenge (referred to as “the SMC-DNA Challenge”), a benchmarking competition which will be detailed in the next section. Somatic SNVs are a common mechanism for altering gene function in cancer, but are difficult to identify because of they occur at very low frequency in the genome (ranging from 0.1 to 100 mutations per megabases (Mb), and may occur only in a subset of the sequenced cells because of tumor heterogeneity or contamination. MuTect has therefore been developed to allow a more sensitive detection of low-allelic-fraction mutations than previous algorithms, while maintaining a strong precision.
Figure 2 summarizes how MuTect operates to call somatic mutations highly accurately. Firstly, the aligned reads in the tumor and normal sequencing data are preprocessed – using the information encoded in the BAM file, MuTect dismisses reads with too many mismatches or very low quality scores since these represent reads that introduce more noise than signal. Secondly, MuTect performs a statistical analysis which identifies sites that are likely to carry somatic mutations with high confidence, based on a predefined decision threshold which depends on the expected mutation frequency and the desired false positive rate. The choice of decision threshold can be used to control the tradeoff between specificity and sensitivity, as described by a receiver operating characteristic (ROC) curve. The identified variants are then put through six site-based filters which aim to remove artefacts, refined through a Panel of Normal samples (PON) to screen out remaining false positives caused by rare error modes only detectable in additional samples, and germline mutations recorded in the Single Nucleotide Polymorphism Database (dbSNP) are removed.
It is noteworthy to mention that an upgraded version of MuTect, called MuTect2, has recently been released in beta status; this version combines germline caller HaplotypeCaller’s local reassembly engine and the original genotyping engine of MuTect. This new method is meant to be high-performing not only in detecting somatic SNVs, but also indels.
Because there is such a wide range of genomic rearrangements, SV detection is more challenging. Tools like DELLY have scored high in the SMC-DNA Challenge for SV detection; it manages to delineate structural variants with specificity and sensitivity thanks to the integration of multiple sources of data: short insert paired-ends, long-range mate-pairs and split-read alignments (Rausch et al 2012). Finally, several algorithms are available for the detection of copy number variation in WGS data from tumor samples – these include QDNAseq, CNAnorm and Patchwork.
Many methods for calling cancer variants have been developed, but their outputs are highly divergent. For example, TCGA lung cancer samples were analysed by four major genome centers for SNV detection for The Cancer Genome Atlas (TCGA), and only 31.0% of SNVs were identified by all four (Kim and Speed 2013). For this reason, there have been efforts to benchmark these many methods, i.e. to compare their performances and their error profiles to determine where improvements are called for, to analyze how some algorithms achieve their high performance levels and to use this information to then improve overall performance.
One of the issues that had been present in previous ways of assessing the performance of the pipelines was the establishment of gold standards. Indeed, validation datasets were often used to train the algorithms, or if obtained on independent technology or from higher-depth sequencing, validation data generally could exhibit sources of error similar to those of the algorithms being assessed (for example, alignment artefacts). In early 2016, a variant calling pipeline performance comparison challenge was set up to take advantage of a dataset considered to be highly “truthful”. The precisionFDA Truth challenge was launched as part of the U.S.A. President Barack Obama’s Precision Medicine Initiative, right as the Genome in a Bottle Consortium (GIaB) was about to release a new reference dataset corresponding to their characterization for high-confidence SNPs, indels, and homozygous reference regions in the NA24385 genome, made from three Ashkenazis genomes. The reads for that genome were processed by the competing mapping and variation calling pipelines – this allowed a precise comparison of pipelines for different genomic areas and variant subtypes, based on the assumption that the GIaB dataset is indeed the gold standard. While performance specifically in non-coding regions was not a criterion to get an award, this feature can be explored in the results for the different pipelines.
Without access to that data, a year earlier, another widely successful benchmarking challenge was launched – this is the SMC-DNA Challenge mentioned previously, a crowdsourced benchmark of somatic mutation detection algorithms (Ewing et al 2015). The organizers developed a tool called BAMSurgeon which added synthetic mutations to existing reads, thus creating 3 different simulations of tumor genomes. These 3 datasets, which varied in complexity, were used to assess the performance of variant callers such as MuTect and VarScan2. The performance metrics used were precision, recall and F-score, the harmonic mean of precision and recall which reflects overall accuracy. These metrics varied substantially across submissions: for the simplest tumor, IS1, which had a moderate mutation rate, 100% tumor cellularity and no subclonality. recall ranged from 0.559 to 0.994, precision from 0.101 to 0.997 and F-score from 0.046 to 0.975.
This challenge yielded interesting results: even for the highest performing algorithms, predictions were far more accurate for coding SNVs than for those in UTRs, introns or intergenic regions. This may be symptomatic of the research bias towards exons, reflecting algorithm tuning on exome sequences or gaps of databases used for germline filtering (dbSNP) for non-coding regions. Furthermore, many algorithms exhibited a characteristic false positive pattern, possibly due to introduction of deamination artefacts during library preparation. Finally, it was also found that an ensemble of several methods outperforms any single tool, suggesting that mutation calls should be made by aggregating multiple algorithms. These results reveal characteristics associated with accuracy that could be exploited in future algorithm development, in particular for non-coding regions.
Following a systematic screening of alterations in a tumor cell and filtering of germline mutations, the tens of thousands of somatic variants in the VCF require processing to distinguish the few “driver” or highly damaging mutations which could be the underlying force behind the development of tumors. Prioritization is an essential preliminary step to guide further in-depth analysis to understand the mechanisms underlying oncogenesis.
Several computational methods have been developed to predict drivers from exome variant data. Algorithms such as PolyPhen2 (short for Polymorphism Phenotyping) (Adzhubei et al. 2010) aim to predict the functional impact of missense mutations in coding genes. PolyPhen2 was trained on 3,155 damaging alleles annotated as causing human Mendelian diseases and affecting protein stability or function, together with 6,321 differences between human proteins and their closely related mammalian homologs, assumed to be nondamaging. This enables a Bayesian classifier to characterize variants as benign, possibly damaging or probably damaging based on evolutionary conservation, protein structural information and physicochemical properties of amino acid changes. A similar prioritization method is SIFT (Sorting Intolerant from Tolerant) (Kumar et al. 2009). For a given protein sequence, SIFT compiles a dataset of functionally related protein sequences, builds an alignment of these homologous sequences and calculates normalized probabilities for all possible 20 amino acids at each position. SIFT predicts a substitution to affect protein function if the scaled probability, also called the SIFT score, lies below a certain threshold value. Generally, highly conserved positions are more intolerant to substitutions. However, there has been criticism over using evolutionary sequence conservation as indicators of deleteriousness (MacArthur et al. 2014) because of statistical factors – namely the accuracy of the underlying sequence alignments – and biological factors – some sites may remain highly conserved despite not being subject to a strong selective pressure. This may explain why in a comparative study of SIFT and PolyPhen for the detection of loss-of-function or gain-of-function mutation (Flanagan et al. 2010), while both exhibited a reasonably high sensitivity (69% and 68%, respectively), the two methods had a low specificity (13% and 16%).
Another approach is based on prioritizing sequences that are mutated more frequently than expected from the background mutation rate (a characteristic called recurrence) across a cohort of tumor samples. However, there are also challenges in using this method as algorithms need to account for mutational heterogeneity across the genome. The local mutation rate can indeed be variable depending on factors such as replication timing, and an added complication is that tumor cells often have defects in DNA repair processes. In addition, measurements of mutation frequency are likely to favour early driver genes over genes that may be mutated later in tumor development.
A method called OncodriveFM (Gonzalez-Perez and Lopez-Bigas 2012) combines the two aforementioned methods to overcome some of their flaws: it classifies as candidate driver genes those that exhibit a bias toward the accumulation of variants with high functional impact, based on scores from SIFT and PolyPhen. From figure 3, which depicts the basic principle of OncodriveFM, it is apparent that this method overcomes the issue of evaluating the background mutation rate. The application of the method on three tumor datasets revealed that most of the highly recurrent and well-known cancer genes do exhibit a clear FM bias, thus validating the approach.
Prioritizing variants mapping to non-coding regions is challenging because there are many more non-coding mutations than coding ones, since the exome represents such a small part of the genome. It is thus necessary to have high performing computational tools that will identify the key driver mutations within tens of thousands of somatic variants. Functional annotation of non-coding regions has recently been completed thanks to high-throughput projects such as ENCODE and Roadmap Epigenomics (Roadmap Epigenomics Consortium, 2015), which makes variants mapping to these regions interpretable. These projects have provided researchers with a genome-wide map of regulatory elements, transcription factor binding sites, histone modification patterns, DNase I hypersensitive sites, as well as RNA-seq and replication timing data across many cell lines.
These annotations of human functional elements, of non-coding regions in particular, have been widely integrated in computational tools for the detection of genome-wide variants implicated in cancer. Kircher et al. 2014 developed a machine-learning algorithm called Combined Annotation-Dependent Depletion (CADD), which contrasted the annotations of fixed alleles in humans with those of simulated variants. The novelty of CADD compared to other algorithms at the time was that it objectively weighted and integrated diverse information, and worked across functional categories, effect sizes and genetic architectures, but still outputted a single metric in the form of a C-score ranged between 0 and 99. Integrating data from the ENCODE project, annotation from the Ensembl Variant Effect Predictor (VEP), and information from UCSC genome browser tracks, it used conservation metrics, functional genomic data, transcript information and protein-level scores and thus outperformed all scoring tools that exploited only one information type. A similar tool called FATHMM-MKL (Shihab et al 2015), has since been shown to outperform CADD for predicting the functional consequences of non-coding variants and retains a comparable performance for predicting the impact of coding variants. FATHMM-MKL was trained on germline mutations from the Human Gene Mutation Database (HGMD), with the control dataset constructed using SNVs from the 1000 Genomes Project. The functional scores for individual mutations from FATHMM-MKL are in the form of a single p-value, ranging from 0 to 1. This algorithm is currently used for the Catalogue Of Somatic Mutations In Cancer (COSMIC) database, the world’s largest and most comprehensive resource for exploring the impact of somatic mutations in human cancer. In the database, mutations with scores over 0.5 are classified as “deleterious”, and above 0.7, “pathogenic”.
Recurrence is also used for non-coding regions, however, the regulatory region of TERT is the only one that has been detected to be mutated at a high rate (in 79% of sarcomas). As mentioned previously, variations in mutation rates across the genome render this approach somewhat inaccurate; this is particularly the case for non-coding regions. No direct correlation between recurrence in a regulatory region and gene expression has been observed – one explanation for this could be that the mutations are not functional, and are only recurrent because of underlying mutational signatures or mutation rate co-variates. Specifically, it was recently uncovered that mutation density at gene promoters can be linked to promoter activity and differential nucleotide excision repair (Perera 2016). Trying to overcome the fact that recurrence in non-coding regions and gene expression are not clearly linked has led to the integration of functional impact scores from CADD when assessing recurrent regulatory variants in the tool OncodriveFML (Mularoni et al 2016) – an improved and genome-wide performing version of OncodriveFM. In the absence of any gold standard for the detection of non-coding driver elements, this method was still proved to have a lower rate of false positives than similar approaches based on positive selection.
Other methods such as DeepSea (Zhou and Troyanskaya 2015) aim to identify the functional effect of mutations de novo from the sequence. These methods utilize machine-learning models trained on datasets of supposed pathogenic variants and aim to predict the effect of cis-regulatory mutations at nucleotide-level by computing whether they lead to the gain or loss of TF-binding motifs.
Finally, FunSeq2 is a widely used framework which integrates large-scale genomics and cancer resources (Fu et al. 2014) to prioritize non-coding SNVs and indels. As shown on figure 4, the pipeline has a weighted scoring system which combines many of the features described above: for instance, loss- and gain-of-function events for transcription-factor binding and per-element recurrence across samples. It also considers inter- and intra-species conservation and enhancer-gene linkages, and importantly, incorporates network topology analysis. The latter provides information about the connectivity of the target genes of non-coding variants. Mutations in regulatory regions of highly connected genes in protein-protein interaction and regulatory networks have been suggested to have a higher functional impact than those occurring in peripheral genes in the network.
It is worth noting that, as presented, there are many pathogenicity scoring tools; however, they are optimized for either exons or whole genome, which generally worsens their feature-specific performance compared to specialist predictors. The Prabakaran lab is currently building a variant caller and a pathogenicity predictor for short open reading frames (sORFs), which consist of a sequence of sense codons bounded by start and stop codons, typically between 2-100 codons in length. Although these sORFs are potentially translatable to produce short peptides (sPEP), the majority are not thought to be translated, and as such these regions have been somewhat disregarded as non-functional gene regions. By training a novel Random Forest algorithm on a holistic range of sORF specific features, they are hoping to produce a pathogenicity classifier that has better precision and sensitivity than other whole-genome pathogenicity predicting tools. Similarly, Li and Lv (2016) have built a linear regression model to prioritize specifically potential lung cancer-associated long non-coding RNAs (lncRNAs), non-coding elements that present very specific features.
Since the COSMIC launched in 2004, detailing four cancer genes, there has been an enormous growth in cancer genomics, allowing COSMIC to now represent full literature curations of 616 genes in April 2017, while the number of non-coding driver elements that have been validated is much smaller. This prompts the question: is there a very small number of non-coding variants that actually play an important role in the development of cancer, or have many of them gone undetected?
Khurana et al. (2017) present a hypothesis that could justify the view that many non-coding driver mutations remain to be found, and which would explain the difficulties in imitating variant prioritization strategies from protein-coding regions for non-coding regions. They suggest that many noncoding mutations, for example a change in a TFBS, which could slightly alter the affinity with a TF, may have a smaller individual contribution to tumorigenesis, compared to mutations which may alter completely the function of a protein. If this “mini driver” model of evolution of polygenic cancer (Castro-Giner et al. 2015) is correct, some noncoding mutations may be dismissed as passenger mutations by current prioritization tools used for coding regions which only detect “major drivers” or highly pathogenic variants. This could also explain why, as mentioned earlier, recurrence for elements in regulatory regions is hard to detect. Work by Bailey et al. (2016) on the combinatorial impact of alterations in the regulatory sequences of the ESR1 gene involved in breast cancer suggests that selection wouldn’t operate on a specific variant (such as has been observed for many coding drivers), but rather that because many enhancers control the expression of one gene, selection acts on a phenotypic outcome. This highlights the importance for non-coding variant prioritization methods for cancer to be sensitive to small, cumulative effects, and especially to integrate a network approach. Moreover, some regulatory elements may be tissue-specific so the impact of variants may differ accordingly; there may also be variation in the functional impact of a mutation at different developmental stages or even different individuals. As more WGS data becomes available and prioritization computational methods improve, it is expected that many new causative non-coding mutations will be discovered.
This review has showed that while the ENCODE project has predicted an important role for non-coding variants in driving complex diseases, it is still difficult to interpret Whole-Genome data and to identify highly damaging mutations with great certainty. Enhancing base calling quality, improving calling (especially of indels and structural variants) and understanding the cumulative effects of co-expression of multiple genes, as well as the interplay between germline and somatic variants, will all be crucial to decipher through association studies the mechanisms underlying tumorigenesis. The integration of more epigenomics data from different cancer samples will help to further understand the impact of non-coding variants on rewiring of the regulatory network as cells transform from normal to malignant state. Furthermore, deep sequencing of samples of tumor cells will allow the discovery of variants in different sub-clones, thus revealing a dynamic view of clonal evolution patterns and mutation events. This will help predict the development of drug resistance.
Moreover, the potential for functional validation of the prioritized non-coding variants has been improved with the development of the CRISPR-Cas9 technique, which enables the reproduction of specific mutations in model organisms. Even though the biological validation for cancer-causing effects can be costly because of the model systems used and the amount of time it takes to achieve a biologically relevant readout, systematic approaches using CRISPR-Cas9 may accelerate this process (Khurana et al. 2016).
As the roles of non-coding variants in cancer development slowly become better understood and may easily be validated, we can be hopeful that medical practices will develop in parallel with new knowledge – providing further strides towards personalised medicine. It is already becoming common for oncology diagnosis and therapy to perform individual whole exome sequencing. This enables the identification of the disrupted gene(s) which cause tumor development, which helps categorising precise subtypes of cancer that present the same phenotype, but which have different underlying causes and thus respond differently to treatment. In addition to informing pharmacogenomics, identifying genetic causes of different types of cancer paves the way for gene therapy, which has had limited success thus far but is expected to grow with the improvement of delivery vectors. These medical applications for protein-disrupting variants discovered through association studies may in the future be applicable to non-coding variants. However, as the clinical relevance of 98% of the genome is still uncertain, the cost of increasing sequencing 50 times from exome-sequencing to cover the whole genome is prohibitive for precision oncology. Further maximizing the cost effectiveness of genomic coverage and strong evidence for the involvement of non-coding variants in cancer will be crucial before patient whole-genome sequencing becomes common clinical practice.
Adzhubei I.A., Schmidt S., Peshkin L., Ramensky V.E., Gerasimova A., Bork P., Kondrashov A.S. and Sunyaev S.R. (2010). A method and server for predicting damaging missense mutations. Nat. Methods 7, 248-9.
Broad Institute, GATK Best Practices. https://software.broadinstitute.org/gatk/best-practices/. Last accessed 20 April 2017.
Broad Institute, GATK Forums, Base Quality Score Recalibration. http://gatkforums.broadinstitute.org/gatk/discussion/44/base-quality-score-recalibration-bqsr. Last accessed 20 April 2017.
Buchhalter I., Hutter B., Alioto T.S., Beck T.A., Boutros P.C., Brors B., Butler A.P., Chotewutmontri S., Denroche R.E., Derdak S. et al. (2014). A comprehensive multicenter comparison of whole genome sequencing pipelines using a uniform tumor-normal sample pair. bioRxiv 013177.
Castro-Giner F., Ratcliffe P. and Tomlinson I. (2015). The mini-driver model of polygenic cancer evolution. Nat Rev Cancer. 15, 680-5.
Catalogue of Somatic Mutations in Cancer (COSMIC). http://cancer.sanger.ac.uk/cosmic/. Last accessed 20 April 2017.
Cibulskis K., Lawrence M.S., Carter S.L., Sivachenko A., Jaffe D., Sougnez C., Gabriel S., Meyerson M., Lander E.S. and Getz G. (2013). Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31, 213–219.
Cuykendall T.N., Rubin M.A. and Khurana E. (2017). Non-coding genetic variation in cancer. Current Opinion in Systems Biology 1, 9–15.
Doolittle W.F. (2013). Is junk DNA bunk? A critique of ENCODE. Proc. Natl. Acad. Sci. U.S.A. 110, 5294–5300.
Ewing A.D., Houlahan K.E., Hu Y., Ellrott K., Caloian C., Yamaguchi T.N., Bare J.C., P’ng C., Waggott D., Sabelnykova V.Y. et al. (2015). Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection. Nature Methods 12, 623–630.
Flanagan S.E., Patch A. and Ellard S. (2010). Using SIFT and PolyPhen to Predict Loss-of-Function and Gain-of-Function Mutations. Genetic Testing and Molecular Biomarkers 14, 533-537.
Fu Y., Liu Z., Lou S., Bedford J., J. X. Mu, Yip K.Y., Khurana E. and Gerstein M. (2014). FunSeq2: a framework for prioritizing noncoding regulatory variants in cancer. Genome Biology 15, 480.
Gonzalez-Perez A. and Lopez-Bigas N. (2012). Functional impact bias reveals cancer drivers. Nucleic Acids Res. 40, 169.
Khurana E., Fu Y., Chakravarty D., Demichelis F., Rubin M.A. and Gerstein M. (2016) Role of non-coding sequence variants in cancer. Nat Rev Genet. 17, 93-108
Khurana E., Fu Y., Colonna V., Mu X.J., Kang H.M., Lappalainen T., Sboner A., Lochovsky L., Chen J., Harmanci A. et al. (2013). Integrative annotation of variants from 1092 humans: application to cancer genomics. Science 342, 1235587.
Kim S.Y., Speed T.P. (2013) Comparing somatic mutation-callers: beyond Venn diagrams. BMC Bioinformatics. 14,189.
Kircher M., Witten D.M., Jain P., O’Roak B.J., Cooper G.M. and Shendure J. (2014). A general framework for estimating the relative pathogenicity of human genetic variants. Nature Genetics 46, 310–315.
Knudson A.G. Jr. (1971). Mutation and cancer: statistical study of retinoblastoma. Proc. Natl. Acad. Sci. U.S.A. 68, 820-3.
Kumar P., Henikoff S. and Ng P.C. (2009). Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat. Protoc. 4, 1073-81.
Li H. and Lv X. (2016). Functional annotation of noncoding variants and prioritization of cancer-associated lncRNAs in lung cancer. Oncol. Lett. 12, 222-230.
Lim L.P., Lau N.C., Garrett-Engele P., Grimson A., Schelter J.M., Castle J., Bartel D.P., Linsley P.S. and Johnson J.M. (2005). Microarray analysis shows that some microRNAs downregulate large numbers of target mRNAs. Nature 433, 769-773.
MacArthur D. G., Manolio T. A., Dimmock D. P., Rehm H. L., Shendure J., Abecasis G.R., Adams D.R., Altman R. B., Antonarakis S. E., Ashley E.A. et al. (2014) . Guidelines for investigating causality of sequence variants in human disease. Nature 508, 469–476.
Patrushev L.I., Kovalenko T.F. (2014) Functions of noncoding sequences in mammalian genomes. Biochemistry (Mosc). 79, 1442-69.
Perera D., Poulos R.C., Shah A., Beck D., Pimanda J.E. and Wong J.W. (2016). Differential DNA repair underlies mutation hotspots at active promoters in cancer genomes. Nature 532, 259-63.
Prabakaran S., Hemberg M., Chauhan R., Winter D., Tweedie-Cullen R.Y., Dittrich C., Hong E., Gunawardena J., Steen H., Kreiman G. and Steen J.A. (2014). Quantitative profiling of peptides from RNAs classified as noncoding. Nature Communications 5, 5429.
Precision FDA Challenge. https://precision.fda.gov/challenges/truth. Last accessed 20 April 2017.
Rausch T., Zichner T., Schlattl A., Stütz A.M., Benes V. and Korbel J.O. (2012) DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28, i333–i339.
Roadmap Epigenomics Consortium, Kundaje A., Meuleman W., Ernst J., Bilenky M., Yen A., Heravi-Moussavi A., Kheradpour P., Zhang Z., Wang J. et al. (2015). Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330.
Shihab H.A., Rogers M.F., Gough J., Mort M., Cooper D.N., Day I.N., Gaunt T.R. and Campbell C. (2015). An integrative approach to predicting the functional effects of non-coding and coding sequence variation. Bioinformatics 31, 1536-43.
The ENCODE Project Consortium (2012). An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74.
Zhou J. and Troyanskaya O.G. (2015). Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods. 12, 931-4.