RNA-Seq has become increasingly popular in transcriptome profiling. The major challenge in RNA-Seq data analysis is the accurate mapping of junction reads to their genomic origins. To detect splicing sites in short reads, many RNA-Seq aligners use reference transcriptome to inform placement of junction reads. However, no systematic evaluation has been performed to assess or quantify the benefits of incorporating reference transcriptome in mapping RNA-Seq reads. In this paper, we have studied the impact of reference transcriptome on mapping RNA-Seq reads, especially on junction ones. The same dataset were analysed with and without RefGene transcriptome, respectively. Then a Perl script was developed to analyse and compare the mapping results. It was found that about 50–55% junction reads can be mapped to the same genomic regions regardless of the usage of RefGene model. More than one-third of reads fail to be mapped without the help of a reference transcriptome. For “Alternatively” mapped reads, i.e., those reads mapped differently with and without RefGene model, the mappings without RefGene model are usually worse than their corresponding alignments with RefGene model. For junction reads that span more than two exons, it is less likely to align them correctly without the assistance of reference transcriptome. As the sequencing technology evolves, the read length is becoming longer and longer. When reads become longer, they are more likely to span multiple exons, and thus the mapping of long junction reads is actually becoming more and more challenging without the assistance of reference transcriptome. Therefore, the advantages of using reference transcriptome in the mapping demonstrated in this study are becoming more evident for longer reads. In addition, the effect of the completeness of reference transcriptome on mapping of RNA-Seq reads is discussed.
Citation: Zhao S (2014) Assessment of the Impact of Using a Reference Transcriptome in Mapping Short RNA-Seq Reads. PLoS ONE 9(7): e101374. https://doi.org/10.1371/journal.pone.0101374
Editor: Christophe Antoniewski, CNRS UMR7622 & University Paris 6 Pierre-et-Marie-Curie, France
Received: March 5, 2014; Accepted: June 6, 2014; Published: July 3, 2014
Copyright: © 2014 Shanrong Zhao. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The author confirms that all data underlying the findings are fully available without restriction. All data accessible from ArrayExpress (accession # is E-MTAB-513).
Funding: The author has no support or funding to report.
Competing interests: Author Shanrong Zhao is employed by Janssen Research & Development, LLC. There are no patents, products in development or marketed products to declare. This does not alter the author’s adherence to all the PLOS ONE policies on sharing data and materials.
In recent years, RNA-Seq has become a popular and powerful approach for transcriptome profiling –. RNA-Seq not only has considerable advantages for examining transcriptome fine structure–for example, in the detection of novel transcripts, allele-specific expression, and alternative splicing–but also provides a far more precise measurement of levels of transcripts than that of other methods such as microarray –. Previously we had performed a side by side comparison of RNA-Seq and microarray in investigating T cell activation, and demonstrated that RNA-Seq is superior in detecting low abundance transcripts, differentiating biologically critical isoforms, and allowing the identification of genetic variants . In addition, RNA-Seq also has a much broader dynamic range than microarray, which allows for the detection of more differentially expressed genes with higher fold-change. Furthermore, RNA-Seq avoids technical issues in microarray related to probe performance such as cross-hybridization, limited detection range of individual probes, and nonspecific hybridization. And thus, RNA-Seq delivers unbiased and unparalleled information about the transcriptome and gene expression . Currently, RNA-Seq is becoming an attractive approach in the profiling of gene expression and in evaluating differential expression .
However, RNA-Seq poses novel algorithmic and logistical challenges for data analysis and storage –. Despite the fact that many computational methods have been developed for alignment of reads, quantification of gene and/or transcripts, and identification of differentially expressed genes, there is great variability in the maturity of these available computational tools. In recent years, a large number of mapping algorithms – have been developed, and most recently, Engström et al  had performed a comprehensive assessment of the performance of 26 mapping protocols based on 11 programs and pipelines available, and found major performance differences between methods on numerous benchmarks, including alignment yield, basewise accuracy, mismatch and gap placement, exon splicing discovery and suitability of alignments for transcript reconstruction. They demonstrate that choice of alignment software is critical for accurate interpretation of RNA-Seq data, and identify aspects of the spliced-alignment problem in need of further attention.
The first step and a major challenge in RNA-Seq data analysis is the accurate mapping of sequencing reads to their genomic origins including the identification of splicing events. However, accurate alignment of high-throughput RNA-Seq data is a challenging and yet unsolved problem because of exon-eon spanning junction reads, relatively short read lengths and constantly increasing throughput of the sequencing technologies. Two key tasks make these analyses computationally intensive. The first task is an accurate alignment of reads that contain mismatches, insertions and deletions caused by genomic variations and sequencing errors. The second task involves mapping junction reads that span two or more exons. Although the first task is shared with DNA resequencing efforts, the second task is specific and crucial to the RNA-seq. These alignment challenges are further confounded by the presence of multiple copies of identical or related genomic sequences, making precise mapping difficult.
To detect splicing sites in short reads, there are three approaches published so far. The tools such as MapSplice  and TopHat  implement a two-step approach in which initial read alignments are analyzed to discover splicing sites; these splicings are then used to guide final alignments of those ‘initially unmapped reads’. However, this approach requires exons to have sufficiently high expression and will miss splicing events that are spanned by individual reads at a low level. The second solution for detecting splicing in short reads has been to align them to a reference transcriptome, possibly augmented with artificially constructed exon–exon segments. However, such an approach will identify only known or predicted combinations of exons, but not unexpected exon pairs that occur through exon skipping, cryptic splicing or gene fusions. In practice, these two approaches can be used in conjunction to improve placement of short sequence reads. Most recently, the “seed-and-vote” approach  is introduced. The new strategy uses a number of overlapping seeds from each read, called subreads. Instead of trying to pick the best seed, the strategy allows all the seeds to vote on the optimal location for the read. The algorithm then uses more conventional alignment algorithms to fill in detailed mismatch and indel information between the subreads that make up the winning voting block. The strategy is fast because the overall genomic location has already been chosen before the detailed alignment is done. It is also sensitive because no individual subread is required to map exactly, nor are individual subreads constrained to map close by other subreads.
Nowadays, many RNA-Seq aligners use reference transcriptome to inform spliced-read placements, including GSNAP , OSA , STAR , TopHat  and etc. In fact, this has become a common practice in RNA-Seq data analysis. However, no systematic evaluation has been performed to assess and/or quantify the benefits of incorporating reference transcriptome in mapping RNA-Seq reads. In this paper, we want to fill this gap. The same RNA-Seq dataset were first analysed with and without reference transcriptome, respectively, and then compared the mapping results and demonstrated the benefits for reference transcriptome in RNA-Seq data analysis.
The Human Body Map 2.0 Project by Illumina generated RNA-Seq data for 16 different human tissues and can be accessible from ArrayExpress (accession # is E-MTAB-513). We used the 75 bp single read data from heart, liver, lung and kidney. In order to investigate the impact of sequencing depth on results, for each selected tissue, we sampled 10%, 30% and 60% of reads from the corresponding raw data, and analysed the subset data in the same protocol. The subsets are denoted as s10, s30, and s60, respectively (Table 1).
The RefGene annotation files in GTF format was downloaded from UCSC genome browser on December 14 2012. First go to UCSC genome browser website (genome.ucsc.edu) and click “Table” menu to bring up “Table Browser” interface. Then set (1) genome to “human”, (2) assembly to “hg19”, (3) group to “Gene and Gene Predictions”, (4) track to “RefSeq Genes”, and (5) output format to “GTF – gene transfer format”. And then click “get output” button and save the text file. The RefGene model is used in our analysis.
Primary sequencing reads were first mapped to RefGene transcriptome and the human reference genome hg19 using OSA (Omicsoft Sequence Aligner, http://www.omicsoft.com/osa) , a super-fast and accurate alignment tool for RNA-Seq data. Benchmarked with existing methods such as Tophat and others, OSA improves mapping speed 4–10 fold with better sensitivity and less false positives. We have chosen OSA in Stormbow, a cloud-based pipeline we developed for large-scale RNA-Seq data analysis. OSA white paper (http://www.omicsoft.com/downloads/whitepaper/OmicsoftAligner.pdf) has more technical details on its implementation. For novel splicing sites (i.e. those junction sites not included in transcriptome), OSA uses Seed-Extend approach to discover junctions.
There are a total of three independent runs with different parameter settings: (1) RefGene/Unique run–aligning reads to RefGene transcriptome first and then to the reference genome, and only uniquely mapped reads are reported; (b) None/Unique run– reads are mapped to genome only without using reference transcriptome and only uniquely mapped read are kept. In this run; (c) RefGene/Multiple run–mapping reads to RefGene transcriptome and then to the genome, but reporting all locations if a read can be mapped equally well to multiple genomic regions. Next, we compared the difference of alignments for each read with and without RefGene transcriptome. All alignments were first exported into SAM text files , and then a Perl script was written to compare the alignments. Our comparison focused on “RefGene/Unique” versus “None/Unique”. For those reads mapped in “RefGene/Unique”, they can be divided into three categories according to their mapping results in “None/Unique” run. The first category is denoted as “Identical” – the mapping is exactly the same. The second is “Alternative” – still mapped but differently, either to a different genomic region or with different splicing sites. The last category is “Unmapped”, in which a mapped read becomes unmapped without the help of reference transcriptome.
We are particularly interested in those junction reads that spans two or more exons. Based upon the CIGAR  string for each mapped read in SAM files, we identified junction reads from each category, and calculated their statistical summaries. Additional analysis was performed on “Alternative” and “Unmapped” junction reads to characterize their splicing patterns in terms of the overlaps with exons. Furthermore, we investigated the key reasons for “Alternative” and “Unmapped” junction reads. At last, the “unique” versus “multiple” mapping mode was compared and explored.
The overall mapping summary for different run settings
The mapping summary with different parameter settings was reported in Table 1. The column one corresponds to the total number of sequence reads. Compared “None/Unique” with “RefGene/Unique” run, an average of 6∼9% reads fail to be aligned. Clearly, with the help of RefGene model, more reads can be mapped, especially for junction reads as we demonstrated below. With respect to unique-mapping mode, roughly 5–10% more reads can be mapped in multiple-mapping mode, as we can see from the comparison of “RefGene/Multiple” with “RefGene/Unique”. We do not see much difference in mapping summary when a same sample is sequenced at varying depth in Table 1. Take the heart sample as an example; its uniquely mapped percentage in “RefGene/Unique” run is 80.76%, nearly identical to the results from its subsets in which 10%, 30% or 60% reads are randomly sampled, respectively.
As described above, all mapped reads in “RefGene/Unique” run can be divided into three categories with reference to their corresponding mappings in “None/Unique” run. After comparing the mapping results, the number of “Identical”, “Alternative” and “Unmapped” reads and their corresponding percentages are detailed in Table 2. While the majority of reads are not affected, 7.1 to 10.7% of mapped reads fail to be aligned without the help of RefGene model. Additionally, a small portion (ranging from 1.76 to 3.05%) of reads remain mapped but differently. As we demonstrate below, the alternative mapping is usually worse than its original mapping in “RefGene/Unique” run. It is noted that the impact of RefGene model on read mapping is sample dependant. As gene expression is very tissue specific, as a result, it is expected that each sample has its own unique gene expression profile. For the same sample, the effect of sequencing depth on the difference between “RefGene/Unique” and “None/Unique” runs is minimal or can be ignored according to Table 2.
The summary in Table 3 is similar to Table 2 but with a focus on junction reads only. According to Table 3, more than one third of junction reads cannot be aligned, and 10–15% reads are mapped in an alternative way when aligning reads without RefGene model. The “Identical” junction reads are only a little more than 50%. Once again, we do not see much effect of sequencing depth. Compared Table 2 with Table 3, it is clear that junction reads are more vulnerable to mapping failure without the help of a gene model. In other words, reference transcriptome mainly impacts spliced-read placements.
To highlight the impact of reference transcriptome on the mapping of junction reads, we calculated the percentages of junction reads over the total reads in each category (Figure 1) based upon the summary in Table 2 and Table 3. Overall, junction reads account for about 20% of all mapped reads in “RefGene/Unique” run. However, the percentages jump up to nearly 100% for “Alternative” and 65–80% for “Unmapped” reads (Figure 1), respectively. This means nearly all “Alternatively” mapped and the majority of “Unmapped” in Table 2 are junction reads. It is confirmed from Figure 1 and Table 3 that a reference transcriptome affects mainly the mapping of junction reads.
Figure 1. The percentages of junction reads over the total read in each category.
Overall, about 20% mapped reads are junction ones. However, the percentages are nearly 100% and 65–80% for “Alternative” and “Unmapped” reads, respectively. RefGene transcriptome mainly influences the mapping of junction reads.
A junction read can span two or more exonic regions. Conceptually, the more exons a junction read spans, the harder it is to align it correctly without the help of a reference transcriptome. For those junction reads in Table 3, we further divided them into sub-groups according to the number of exons they span (see Table 4, Figure 2). On average, 55% two-exon junction reads can be aligned to the same genomic regions regardless of the usage of a gene model when mapping reads. For those junction reads spanning 3 exons, the percentage drops to less than 7%. And the percentage continues to drop to nearly zero for those junction reads spanning 4 or more exons. Obviously, the more exons a junction read spans, the less likely it can be mapped correctly without prior knowledge on junction sites in a gene model (Figure 2).
Figure 2. The relationship between the relative abundance for “Identical”, “Alternative” and “Unmapped” reads and the number of splicing sites.
Evidently, the more exons a junction read spans, the less likely it can be mapped correctly in “None/Unique” run.
The splicing patterns for “Identical”, “Alternative” and “Unmapped” reads
As we conclude above, a reference transcriptome mainly affects the mapping of junction reads. For a junction read spanning more than two exons, it is hard to align it correctly without a reference transcriptome. One interesting question is what kind of junction reads tend to be mapped identically, alternatively or unmapped. According to Table 4, about 98% junction reads span only 2 exons. In order to characterize the splicing pattern, we focus on only two-exon junction reads. For each junction read in “RefGene/Unique” run, we calculate the number of overlapping nucleotide bases with its left exon (OL) and right exons (OR), respectively. Then the minimum of OL and OR is chosen for histogram analysis (Figure 3). Since the full read length is 75 bp long, the MOE (Minimum Overlap with an Exon, MOE = min (OL, OR)) ranges from 1 to 37 for any junction read. For “Identical” junction reads, the typical MOE ranges from 15 to 37, and the frequency drops to nearly 0 when MOE is less than 10. For “Alternative” junction reads, the most dominant MOE is 1, representing an average of one third of cases. In general, those “Alternative” reads have very small MOE. For those junction reads with MOE of 1, 2 and 3, it is virtually impossible to map them ‘correctly’ without the prior knowledge on transcripts. The MOE for “Unmapped” reads has a much broader range with peaks from 4 to 12. In order to map a junction read without a reference transcriptome, the read should have sufficient overlaps with exons at both ends. The majority of “Identical” reads meet this requirement (left panels in Figure 3). However, if the overlap with one end is too short, let’s say 1 or 2 nucleotide bases, this read will be more likely mapped to only a single exon with the remaining couple of bases mapping to the intron region adjacent to the exon (middle panels in Figure 3). Otherwise, such junction reads become either unmapped or mapped to different genomic regions as non-junction reads if the overlap is something between (right panels in Figure 3).
Figure 3. The distribution of MOE (Minimum Overlap with an Exon) for junction reads.
The typical MOE for “Identical” junction reads ranges from 15 to 37. For “Alternative” junction reads, the most dominant MOE is 1, representing an average of one third of cases. In contrast, the MOE for “Unmapped” reads has a much broader range with peaks from 4 to 12. Note the scale for y-axis is not uniform.
Comparison of the mappings of “Alternative” reads
Note that all the examples and illustrations below are from sample Heart.s10, and only unique reads are shown in genome browser. Since “Alternative” reads are mapped to either different genomic regions or splicing sites, we are more interested in the mapping difference in detail and the main reasons for alternative mapping. As shown in Figure 4, those 19 unique junction reads are nearly perfectly mapped to gene HSP90AB1 in “RefGene/Unique” run. Without a reference transcriptome, four reads indicated by red arrow remain mapped to the same gene HSP90AB1 but with mismatches at one end of the read. A few bases previously mapped to another exon are now mapped to the intron region, and accordingly the junction reads become non-junction ones now. The remaining 15 junction reads are aligned to pseudogene gene HSP90AP3P as non-junction reads instead. The comparison reveals that the original mappings to HSP90AB1 for those 15 reads are nearly perfect, while they all have more mismatches when mapped to HSP90AP3P. Clearly, the alternative mapping for those junction reads in Figure 4 is getting worse without a reference transcriptome. In a sense, those 15 junction reads indicated by blue arrow in Figure 4A are “forced” to be mapped to a different genomic region without the help of reference transcriptome. Usually, such ‘forced’ mappings have mismatches, and quite often unusual coverage pattern is seen, as we will see later.
Figure 4. The impact of a reference transcripotome on the mapping of junction reads.
(A) In “RefGene/Unique” run, 19 unique junction reads are mapped to gene HSP90AB1 nearly perfectly. Four of them remain mapped to the same gene but differently and with mismatches without the usage of RefGene model. In fact, the four junction reads become non-junction ones with a few bases mapped to the intron region; (B) The rest 15 reads (indicated by the blue arrow) are alternatively aligned to gene HSP90AP3P in “None/Unique” run, but with worse mismatches and alignment scores compared to their mappings in gene HSP90AB1 (Figure 4A). Note the reads colored in blue are mapped to “+” strand, and colored in green when mapped to “–” strand. The mismatch is colored in red.
The impact of ‘Alternative’ junction reads on gene quantification is demonstrated and further illustrated in Figure 5. With the help of RefGene model, these 38 junction reads in Figure 5A are uniquely mapped to 8 splicing regions in gene PDIA3, and the alignments are nearly perfect. None of them is mapped to PDIA3P. However, all those 38 reads are uniquely mapped to retrotransposed pseudogene PDIA3P instead as non-junction reads without a gene model. Note in Figure 5B no other reads are mapped to PDIA3P. All alignments in Figure 5B have 1 or two mismatches (indicated by red dots in alignment profile). The island-like read coverage pattern in Figure 5B also indicates those mapped reads are false positives. If gene PDIA3P is truly highly expressed, ideally we should see mapped reads along exons evenly. Another sign for false mapping is mismatch, as indicated by those tiny red dots in alignment profile in Figure 5B. Each dot represents a single mutation or mismatch with respect to its nucleotide base in reference genome.
Figure 5. False mappings of “Alternative” junction reads.
(A) With RefGene model, these 38 reads are uniquely mapped to 8 splicing regions in gene PDIA3, and the alignments are nearly perfect. None of them is mapped to PDIA3P. (B) Without a gene model, however, all those 38 reads are uniquely mapped to retrotransposed pseudogene PDIA3P as non-junction reads. All alignments have 1 or two mismatches (indicated by red dots in alignment profile). (C) Without a gene model, all sequences mapped to these six genes are non-junction reads. They are re-mapped to elsewhere as junction reads with the usage of RefGene model. The read coverage profiles in Figure 5B and Figure 5C are neither flat nor continuous, and those sporadic coverage patterns are quite unusual. If these genes in Figure 5B and Figure 5C are truly expressed, we should usually see mapped reads across the entire exon regions, not just a few sporadic and isolated islands. Due to falsely mapped reads, we get misleading information on the expressions of these genes. Note gene CSDAP1 and its transcript is encoded in “–” strand, and colored in green. All the rest other genes colored in blue since they are encoded in “+” strand.
Those 38 sequence reads in Figure 5A and 5B are interesting. They are junction reads in the transcriptome when mapping them with the help of RefGene model (Figure 5A), but all become non-junction ones in the genome without a gene model (Figure 5B). The scenario illustrated in Figure 5A and 5B occurs when a gene has a retrotransposed pseudogene copy elsewhere in the genome. This is the case of PDIA3 (parent gene) and PDIA3P (retrotransposed pseudogene). Retrotransposed pseudogene is generated by reverse transcription of an mRNA transcript with subsequent reintegration of the cDNA into the genome, and the processed pseudogenes can also accumulate random disablements such mutations over the course of evolution. That’s why those mapped reads in Figure 5B have mismatches. The island pattern we observe in Figure 5B corresponds to the reads that should have been mapped to the junctions of the parent gene PDIA3, but have wrongly been mapped to the retrotransposed pseudogene PDIA3P. As a result, those falsely mapped reads in Figure 5B can give rise to misleading information on the expression of pseudogene PDIA3P. In the meantime, we also underestimate the expression of the parent gene PDIA3.
Similarly, for all those genes shown in Figure 5C, neither read remains mapped to them in “RefGene/Unique” run. Like Figure 5B, the sporadic island-like coverage patterns in Figure 5C do not support that those mapped reads truly originate from there either. As a matter of fact, ANXA2P2 is a processed pseudogene, and its parent gene is ANXA2. HSP90AB3P is a retrotransposed pseudogene as well, and it parent gene is HSP90AB1. Those sequences mapped to HSP90AB3P should have been mapped to HSP90AB1 as junction reads, as shown in Figure 4. Without the usage of a reference transcriptome, we get misleading expressions for those genes in Figure 5C as well due to ‘falsely’ mapped reads, and simultaneously underestimate the expression of those genes to which the reads should have been mapped.
“Alternative” junction reads are also likely to be mapped to the same start and end positions but splitting at different sites. Two cases in point are shown in Figure 6. For those junction reads mapped to gene TCEA3 with and without RefGene model, both mappings are equally well in terms of alignment scores and gaps between exons. So there is no way to tell which one is right without the assistance of reference transcriptome. Likewise, the mappings of junction reads in gene FBXL3 are also equally well regardless of the usage of RefGene model. Despite the minor difference in splicing sites, the read mapped with RefGene model is considered as fully compatible to a known gene, and thus is counted in gene quantification. However, without a gene model, the same read is mapped to exon-intron, and thus it is discarded at quantification step when only fully compatible reads are counted. For some counting packages, such as HTSeq , whether such reads partially overlapping with exons are counted is dependent upon the setting of the overlap resolution mode. When the mode is set to intersection-strict in HTSeq, the reads mapped to exon-intron are excluded. Thus, the alternative mappings in Figure 6 are not equal at gene quantification step in RNA-Seq data analysis.
Figure 6. Alternative splicing in “RefGene/Unique” and “None/Unique” runs.
All junction reads are still mapped to the same gene. The start/end positions and intron size are exactly the same regarding of gene model, but split differently. Without the assistance of reference transcriptome, we cannot tell which mapping is correct. Note the coverage profile corresponding to an exon and an intron are colored in red and dark blue, respectively.
Investigation on “Unmapped” reads
More than one third of junction reads fail to be aligned without the help of RefGene model (see Table 3). As shown in Figure 3, “Unmapped” junction reads tend to have a short 4–14 bp overlap with one of exons. Without the help of a reference transcriptome, it is hard to find the right splicing sites, especially for those reads spanning very small exons. Gene TNNT2 is a good example (Figure 7). There are 4 annotated transcripts derived from this gene by alternative splicing. Most exons are short, and even shorter than the read length. Note the exon 12 in the transcript NM_001001430 is as short as 6 bp long, and there is no way to map a junction read to such a small exon without prior knowledge (see bottom in Figure 7). There are a total 2,016 unique junction reads mapped to this gene in “RefGene/Unique” run. Without a gene model, 2000 of those junction reads cannot be mapped to this gene anymore. Accordingly, the expression level for this gene is significantly underestimated due to the mapping failure of the majority of junction reads. This also happens for other genes such as MYH6, MYH7, TPM1, KTN1 and RPL9, in which hundreds or thousands of junction reads become unmapped without the help of a reference transcriptome.
Figure 7. Junction reads across small exons can be mapped with the help of RefGene model, but not without the assistance of a reference transcriptome.
The region around exon 12 in the transcript NM_001001430 is zoomed out (the bottom panel). This exon is too short, only 6 bp long, and there is no way to map a junction read to so small an exon without prior knowledge on splicing sites. Note reads are colored in blue if mapped to “+” strand, and green if mapped to “–” strand.
As we can see from Figure 1, about 70% “Unmapped” reads are junction ones, and the rest ∼30% are non-junction reads. Why is the mapping of a non-junction read affected by a reference transcriptome? We are puzzled by this phenomenon at first sight. After careful investigation, we realized that the main reason is mapping mode. A read that is uniquely mapped in RefGene model is likely to become a multi-read when mapping across the genome, and accordingly excluded when only uniquely mapped read is reported. In Figure 8, there are 12 reads uniquely mapped to gene HSP90AB3P in “RefGene/Unique” run, unfortunately those 12 non-junction reads become unmapped in “None/Unique” run. When multiple-mapping is enabled and the threshold for reported mapping locations is set to 2000, those 12 reads can be mapped to multiple genomic regions, ranging from 57 to 1,075. At the bottom of Figure 8, the number to the right of each sequence is the number of genomic regions where this read can be mapped to. Figure 8 illustrates a scenario in which a unique read in a reference transcriptome can become a multi-read if mapped genome wide. Therefore the usage of reference transcriptome can reduce the mapping ambiguity for some RNA-Seq reads.
Figure 8. The mapping of non-junction reads in different mapping mode.
Twelve reads are uniquely mapped to HSP90AB2P with RefGene model in unique mapping mode (top panel), but all fail to be mapped in “None/Unique” run (middle panel). It turns out these reads can be mapped to multiple genomic regions in addition to gene HSP90AB2P, and thus discarded in “None/Unique” run. The total number of mapping positions across the genome for each read is shown to the right of each sequence (bottom panel). The 7th sequence can be mapped to as many as 1,075 positions.
Those “Unmapped” reads in Figure 7 and Figure 8 have dramatic impacts on accurate estimation of gene expression levels. In general, the expression is underestimated. We have quantified the effect of those “Unmapped” junction reads in Table 3 on gene expression, and it is found that there are a total 700 genes (see Table S1) in which the read counts are reduced by 20% or more due to the mapping failure of junction reads without the help of reference transcriptome.
Unique-mapping versus multiple-mapping
A read can be mapped to multiple locations not only across the reference genome, but even within a reference transcriptome. For instance, in the case of recent paralogs, reads obtained by the sequencing of one member of the gene family will usually map to several members. Gene AREG is an extreme example where the gene is present in 2 copies that did not diverge (Figure 9). These two identical transcripts are located at genomic regions Chr4: 75310853–75320726 and Chr4: 75480629–75490485, respectively. The transcript consists of 6 exons. At the bottom of Figure 9, introns are trimmed and the alignment profiles are shown. Note the alignment profile on the left is exactly identical to the alignment profile on the right. Each read in alignment profile is colored gray, meaning it is a multi-read. In multiple-mapping mode, the expression level for this gene is 20.23 RPKM, while in unique-mapping mode, the expression is ZERO. Obviously, unique-mapping mode is inappropriate for gene AREG.
Figure 9. The necessity of multiple-mapping mode for gene AREG.
Transcript NM_001657 can be generated by alternative splicing from two genomic regions. All reads mapped to one region are surely to be mapped to its twin region as well. As a result, these reads are excluded in unique-mapping mode. In order to properly map these multi-reads to gene AREG, multiple-mapping mode is necessary.
Another interesting example is shown in Figure 10 to demonstrate the impact of mapping mode on gene quantification. In unique-mapping mode, very few reads are mapped to the regions around exon #2 in gene UQCRH. While in multiple-mapping mode, the read coverage profile is more like a flat terrain. Intuitively, multiple-mapping mode works better than unique-mapping mode in term of the pattern of coverage profile. Those reads mapped to gene UQCRH in multiple-mapping mode but not in unique-mapping mode turn out to be junction reads around exon #2, as shown on the track between “Unique mapping” and “Multiple mapping” tracks in Figure 10. In addition to gene UQCRH, those junction reads can be mapped to gene UQCRHL equally well. Therefore, they are multi-reads in RefGene model. That is why they fail to be mapped to gene UQCRH in unique-mapping mode. Note in Figure 10 if we map RNA-Seq reads without a reference transcriptome and in unique-mapping mode, those junction reads are mapped to gene UQCRHL only (right in Figure 10). As a consequence, gene UQCRHL is reported to have a high expression level, and this is untrue. If gene UQCRHL is indeed highly expressed, we should be able to see reads mapped to other regions of this gene as well. The example is Figure 10 illustrates the importance of both a reference transcriptome and the multiple-mapping mode in accurate mapping RNA-Seq reads.
Figure 10. The impact of mapping mode on gene quantification: unique-mapping versus multiple-mapping mode.
Those junction reads on the track between “Unique mapping” and “Multiple mapping” tracks are mapped to the junction regions around exon #2 in gene UQCRH, as well to gene UQCRHL. As a result, these reads are excluded in unique-mapping mode. Consequently, we have a very shallow read coverage around exon #2 in gene UQCRH when aligning reads in unique-mapping mode. In contrast, in multiple-mapping mode, those reads are mapped, and accordingly, the read coverage profile for gene UQCRH significantly improves (left bottom).
According to the mapping summary reported in Table 1, about 5∼10% more reads are mapped in “RefGene/Multiple” run compared with “RefGene/Unique” run, If we increase the reporting threshold for multi-reads from 10 to a higher number, let’s say 100, more non-uniquely mapped reads are expected in Table 1. The impact of mapping mode on alignment is sample dependent (Table 1). As we know, a significant challenge in analyzing RNA expression of homologous genes is the large fraction of the reads mapped to multiple locations in the reference transcriptome and genome. In unique-mapping mode, all multi-reads are discarded, and this is too stringent and not ideal for those genes in Figure 9 and Figure 10. As previously noted , if the multi-reads are discarded, the expression levels of genes with homologous sequences will be artificially deflated. Our research has demonstrated the necessity of multiple-mapping, and thus, in conjunction with reference transcriptome, multiple-mapping mode is strongly recommended for RNA-Seq data analysis. When a read is mapped to multiple locations, report all locations instead of randomly pick one of the locations. The tools for downstream gene quantification can decide whether to include or how to assign multiple-mapping reads to those mapped genes or transcripts.
The impact of a reference transcriptome on differential analysis
Usually, RNA-Seq differential analysis requires replicates. However, we have a single sample in each different tissue. To demonstrate the impact of a reference transcriptome on differential analysis, we calculated the fold change between heart and liver samples with and without the usage of RefGene transcriptome in mapping. The correlation of the calculated Log2(Fold Change) was shown in Figure 11. Ideally, we should get a straight line if the reference transcriptome has no impact on differential analysis. Obviously, this is not true. Although the majority of genes had highly consistent or comparable expression changes, there were quite a few genes that are dramatically affected by the choice of using transcriptome reference or not. The points were colored in blue and red, respectively, if the corresponding absolute difference between the two Log2(Fold Changes) was greater than 2 or 3.3. Some genes were found to have an expression change greater than 32-folds (2∧5) in one analysis, but was found not to change in the other analysis. Clearly, using reference transcriptome in read mapping has an impact on the downstream differential expression analysis.
Figure 11. The correlation of the calculated Log2(Fold Change) between heart and liver samples with and without the usage of RefGene transcriptome in mapping.
The points were colored in blue and red, respectively, if the corresponding absolute difference between the two Log2(Fold Changes) was greater than 2 or 3.3. Although the majority of genes had highly consistent expression changes, there were quite a few genes that were dramatically affected by the choice of using transcriptome reference or not.
Short reads generated by RNA-Seq experiments must ultimately be aligned, or “mapped” to a reference genome or transcriptome assembly. The general objective when mapping a collection of sequencing reads to a reference is to discover the true location (origin) of each read with respect to that reference. Read alignment to a reference provides biological information in two basic ways. First, it generates a dictionary of the genomic features represented in each RNA-Seq sample. Second, the number of reads aligned to each feature approximates abundances of those features in the original sample. Such measures of digital gene expression are then subject to comparison among samples or treatments in a statistical framework. Despite several years of ongoing improvements, alignment of the junction RNA-Seq reads to a reference genome is not a solved problem yet, owing both to its intrinsic complexity and rapid advances in the sequencing technologies. Thus the accuracy of read mapping and gene quantification is critical for differential analysis.
In this paper, we have studied the impact of reference transcriptome on mapping RNA-Seq reads, especially on junction reads. As shown in Table 3 and Figure 1, only about 50–55% junction reads can be mapped to the same genomic regions regardless of the usage of RefGene model. More than one third of junction reads fail to be mapped without the help of a reference transcriptome. For “Alternative” mapped reads, their mappings in “None/Unique” run are usually worse than their corresponding alignments in “RefGene/Unique” run, as we demonstrated in Figure 4, 5 and 6. For those junction reads spanning more than two exons (Table 4 and Figure 2), it is less likely to correctly align them without the help of reference transcript.
All sequence reads in our dataset are 75 bp long. It is noted that the results in this paper will be different if the same tissue sample are sequenced at other different read lengths. The impact of the read length on the mapping of junction reads remains an interesting question for our exploration in the future. Nowadays, most providers, including BGI (the largest sequence service provider in the world), offer RNA-Seq sequencing service to customers by delivering 50–150 bp short reads. As the sequencing technology evolves, the read length is becoming longer and longer. For instance, the newest MiSeq desktop sequencer from Illumina and its MiSeq Reagent Kits v3 can generate reads of 300 bp long. Read length certainly has a remarkable effect on the detection of exon-exon junction and on the mapping of exon-exon spanning reads. When reads become longer, they are more likely to span multiple exons, and thus the mapping of long junction reads is actually becoming more and more challenging without the assistance of reference transcriptome. Thus, the need to have reference transcriptome included in the mapping is greatly increased. Therefore, the advantages of using reference transcriptome in the mapping demonstrated in this study are becoming more evident for longer reads. Indeed, long read lengths increase read uniqueness in mapping, but make alignment of junction reads more difficult without the prior knowledge on transcripts.
However, when the reference transcriptome used to guide the mapping of reads is incomplete or inaccurate, some biases or errors can be introduced as well. Pyrkosz et al  has explored the issue of “RNA-Seq mapping errors when using incomplete reference transcriptome” in detail. They used simulated reads generated from real transcriptomes to determine the accuracy of read mapping, and measured the error resulting from using an incomplete transcriptome. When 10% increments of the chicken reference transcriptome are missing, the true positive rate decreases by approximately 6–8%, while the false positive rate remains relatively constant until the reference is more than 50% incomplete. The number of false positives grows as the reference becomes increasingly incomplete. For model organisms such as human and mouse, their transcriptome models are relatively more complete compared to non-model organisms. The human RefGene transcriptome used in our analysis is a collection of non-redundant, curated mRNA models. It is a relatively stable reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis, expression studies, and comparative analyses. Admittedly, RefGene transcriptome is not 100% complete and accurate, but its quality is constantly improving. For transcriptome guided mapping of RNA-Seq reads, the more complete and accurate the transcriptome, the better.
Pyrkosz et al  also notice that the completeness of the reference transcriptome interacts significantly with the mapping mode. The true positive rate for multiple-mapping mode is dependent solely on the correct transcript being in the reference, while the unique-mapping mode has substantially lower true positive rates. This is consistent with our results. In conjunction with reference transcriptome, it is highly recommended to map RNA-Seq reads in multiple-mapping mode for both junction and non-junction reads, as we have demonstrated in Table 1, Figure 9 and Figure 10.
In addition, Seok et al 
Citation: van der Weide RH, Simonis M, Hermsen R, Toonen P, Cuppen E, de Ligt J (2016) The Genomic Scrapheap Challenge; Extracting Relevant Data from Unmapped Whole Genome Sequencing Reads, Including Strain Specific Genomic Segments, in Rats. PLoS ONE 11(8): e0160036. https://doi.org/10.1371/journal.pone.0160036
Editor: Peng Xu, Xiamen University, CHINA
Received: April 26, 2016; Accepted: July 12, 2016; Published: August 8, 2016
Copyright: © 2016 van der Weide et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The original WGS data used in our analysis are listed in Table 1 and are available under the following accession numbers in the “European Nucleotide Archive”: • ERA000170, http://www.ebi.ac.uk/ena/data/view/ERA000170; Atanur, S. et al. (2010) • ERP002160, http://www.ebi.ac.uk/ena/data/view/ERP002160; Atanur, S. et al. (2013) • SRA046343, http://www.ebi.ac.uk/ena/data/view/SRA046343, Guo, X. et al. (2013) • PRJEB6956, http://www.ebi.ac.uk/ena/data/view/PRJEB6956, Hermsen, R. et al. (2015) The derived data supporting the results of this article are available in the “European Nucleotide Archive” repository PRJEB12009, http://www.ebi.ac.uk/ena/data/view/PRJEB12009.
Funding: This work was financially supported by the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement No. HEALTH-F4-2010-241504 (EURATRANS) to EC.
Competing interests: The authors have declared that no competing interests exist.
Next-generation sequencing (NGS) is used in a large variety of applications ranging from single cell analyses to complex microbial communities and complete vertebrate and plant genome analyses . NGS reads are, in general, aligned to an organism-specific reference genome as a first step in data analysis. Such reference genomes are typically derived from a single individual, animal or strain, with the exception of the human reference genome. Reads that align (map) to the reference genome are subsequently used for data analysis, while the unmapped reads are usually discarded [2,3]. Failure to map against the reference genome can be due to two mechanisms: 1) errors occurred in the sequencing process and as a consequence the read does not faithfully represent the original DNA fragment, or 2) the sequence captured in the read is not, or only partially, present in the reference assembly used for mapping. Filtering out reads originating from the first source is fairly straightforward and implemented in most data processing procedures by discarding reads with low quality scores [4,5]. The second source of unmapped reads often contains sequences from exogenous species due to experimental and sampling contamination [5–8]. Software has been developed that removes this class of sequencing reads [5,6,9,10], however most analysis pipelines do not include such a filtering step, resulting in lower mapping percentages or falsely mapped reads due to the greedy nature of NGS mapping algorithms . Interestingly, a fair portion of unmapped reads remain when sequencing-artifacts and contaminations are removed . With broad applicability of NGS methods, there is an increasing interest in the source of unmapped reads. Previous work has shown that there is relevant information to be discovered [8,13–16]. We set out to determine what type of sequences are present in this "scrapheap" of data obtained in the context of the genomic characterization of more than thirty widely used laboratory rat strains [Rattus norvegicus and Rattus rattus] [17,18].
High quality unmapped reads can contain biological information from loci that are missing from the species reference assembly including strain specific segments. Laboratory rat strains have diverged relatively recently but strain-specific genomic segments can be acquired through, for example, retroviral activity or lost in other strains through genetic drift [19,20]. While copy number variation and non-reference sequence variation have been shown to exist in mouse [Mus Musculus] strains and contribute to diversity [21,22], the sources and promises of unmapped reads have not been investigated in a systematic way. Several recent studies use similar approaches to extract biological relevant information from unmapped reads however these do not provide the underlying source code, making the methods hard to reproduce and implement [12,23].
The large amounts of inbred rat-(sub)strain complete genome sequences generated in the last decade provide a high quality dataset [24–27] which we used for development and implementation of a systematic un-mapped read analysis. While the rat reference genome has continuously been improved since its release in 2003 , to date, no strain specific segments have been included, in contrast to the most recent version of the human reference genome, GRCh38, which does include alternative loci for complex and highly variable regions .
We aligned whole genome sequencing (WGS) data of 33 rat strains to the latest rat reference genome assembly (BN/NHsdMcWi, RGSC5.0) to identify ‘unmappable’ reads (Table 1). Large differences were observed in the absolute amounts of unmapped reads (between 2 and 150 million) per (sub)strain. The highest amount of unmapped reads was found for SHR/Olalpcv, possibly due to the older, less accurate, sequencing technology used (Illumina GAII). Unmapped reads were subjected to a series of filtering steps (Fig 1). In the first step, we filtered out reads with low base-call quality scores (phred <25) and low read-length (< = 50% of the expected read-length) as well as reads that were unmapped due to known genetic variation (Fig 1A). On average 40.5±3.5% of the unmapped reads were removed by these criteria (Fig 2A). We observed that the fraction of remaining reads was independent of sequencing platform or genomic coverage. Interestingly, more than half of the reads pass Quality Control (QC) criteria and likely represent true biological sequences.
Fig 1. Filtering and processing workflow.
Alignment files went through three stages; A) Quality Control, to remove low quality reads and reads affected by genomic variation, B) Mapping-based filtering, to identify reads derived from contaminants and regions present in alternative reference sequences and C) Identification of possible biological function and classification of common / strain-specific status.
Fig 2. Origins of Unmapped reads.
A) Stacked bar-graph of the origins of unmapped reads per strain. Total number of unmapped reads (millions) per strain is displayed on the right vertical axis. B) Distribution of reads, mapping to the alternative reference genome (Celera) and/or to the Y-chromosomal BAC-contigs of SHR/Akr, for male and female samples. C) Contaminant contribution in the twelve samples with contaminant-derived reads. The used contaminant-database consists of RefSeq-genomes of bacteria, viruses and fungi.
Missing sequences in the reference genome
To avoid inclusion of reads that represent genuine reference genome information but could not be mapped due to reference genome gaps, we included raw sequencing reads from the original reference strain (BN/SsNHsdMCW) in our analysis and also utilized an alternative rat genome assembly that is based on two strains (BN/SsNHsdMCW & SD). We aligned the purified unmapped reads to an alternative rat genome assembly (generated by Celera) to identify unmapped reads due to an incomplete reference. The alternative genome assembly, referred to as ‘Celera’, is a hybrid consisting of 79% Brown Norway (BN/SsNHsdMCW) and 21% Sprague Dawley (SD) data . The majority of the high-quality reads (62.1±14.2%) could be mapped to this assembly. To identify reads that aligned to Celera due to missing sequences in RGSC5.0, we annotated the Celera assembly with regions covered by WGS-data of BN/SsNHsdMCW . The vast majority, 93.0±20.5%, of the unmapped reads mapped to these regions (Fig 2B), highlighting the incompleteness of the current RGSC5.0 assembly. The remaining reads that mapped to the Celera assembly (7%) could reflect strain-specific segments, shared with SD, which are lost/absent in BN/SsNHsdMCW.
The Y-chromosome is a source of unmapped reads
Both RGSC5.0 and Celera were based on DNA from female animals and do not contain Y-chromosomal contributions. To determine the amount of unmapped reads due to this omission, purified reads were mapped to sequences of the recently described Y-chromosomal BAC-contigs of SHR/Akr . Four strains showed significantly more reads mapping to these BAC’s: on average 6.7±1.6% of the total unmapped reads (P<0.01). Of those, two are known to be male (Da/BklArbNsi and F344/NHsd), while the other two are originally described as female samples (LE/Stm and SHR/Olalpcv). The latter could be a result from sequencing male animals, however the X-chromosomal coverage depth is similar to the autosomes, arguing against a male sample. Alternatively, these strains may have larger pseudo-Y-chromosomal segments present on the X-chromosome  (Fig 2B). Another 12% of the purified reads from these four strains map to both the Y-chromosomal BAC-contigs and to the alternative reference. Of these, 75.7±6.4% maps to sequences missing in the RGSC5.0 assembly, suggesting they could derive from homologous sequences in (pseudo) autosomal regions (Fig 2B).
Contamination is not a constant factor
A possible source of high quality unmapped reads is contamination. We aligned the remaining unmapped reads to a contamination-database consisting of prokaryotic, viral and fungal RefSeq-genomes (V.61). Other likely contaminants (e.g. parasites, human and mouse) are not included, as their high sequence-homology with rat could lead to removal of rat material due to greedy mapping . The identification of these possible animalia-derived contaminations is performed in the OrthoMCL-analysis later in this study.
Contamination-derived reads were found in 12 samples, with a contribution of more than 1% (median: 4.8%) in 6 experiments (Da/BklArbNsi, F344/NHsd, LE/Stm, LEW/NCrlBR, SHR/Olalpcv and SHRSP/Gla) (Fig 2A). Bacterial-derived reads were the largest contaminant-superkingdom with 88.4%, no bias was observed for a specific genus (Fig 2C). Apart from the positive control sample in Illumina machines -PhiX174- and herpesviridae, no laboratory-specific contaminants were identified.
A large fraction (~70%) of strain-specific contigs that show no homology to mouse consist of (proto)bacteria (Figure D in S1 File), suggesting a large contribution of feces-derived bacterial contamination. The three strains sequenced in East-Asia (F344/Stm, F344/NHsd and DA/BklArbNsi) had significantly higher amounts of predicted peptides with orthologs in the roundworm Brugia malayi (P<0.05). This roundworm is known to cause Brugian elephantiasis, a rare form of lymphedema, and lives in Eastern Asia . This biologically relevant finding shows that our workflow allows identification of (contamination) sequences from members of the Animala-kingdom.
Cross strain similarity of unmapped reads resembles phylogeny
Reads passing the previous filtering steps were most likely to contain strain specific sequences. To investigate the cross-strain similarity between the remaining reads, samples were clustered based on their between-strain sequence similarity. A large fraction (66%) of the reads had similar (>70% identity) sequences in more than one sample. The resulting similarity matrix shows a strong resemblance to previously established phylogenies of inbred rat strains [26,35,36] (Fig 3). An exception is the gender, which is reflected more strongly in the clustering than the phylogeny. This is likely due to the fact that the used Y-contigs are from a strain with a relatively short Y-chromosome. The used male samples can therefore still have a lot of overlap in Y-derived reads after filtering, simply because these have larger Y-chromosomes . The phylogeny is reflected in sub-strains with a shared ancestor-strain, for example the Lyon Hypertensive strains (LH/LN/LL). Interestingly, more distant relationships are also reflected in the similarity matrix, as shown by the clustering of strains derived from the Italian colony of the outbred Wistar rat (MHS/Gib and MNS/Gib). The observation that unmapped reads follow known evolutionary patterns, suggests that strain-specific/non-reference genomic segments can harbor biological interesting strain-specific/non-reference genomic segments.
Fig 3. Similarity-clustering of unmapped reads.
Black semi-filled phylogenic background denotes male samples. Euclidean distance-based clustering of strains on basis of the percentage between-strain read-pair sequence similarity, with a minimal sequence similarity of two non-overlapping blocks of 34bp.
De novo assembly
To assess the function and characteristics of these putative strain-specific segments, we assembled the reads used for the similarity clustering with a de novo assembly pipeline (SOAPdenovo-based, see methods), yielding on average 11Mb of assembled contigs per sample. The average weighted median of contig-sizes per sample (N50) was 910±121bp (Figure A in S1 File). Although repetitive DNA (like satellite DNA) can be a source of species diversity, these did not assemble to contigs, which is a known limitation of short read sequencing and de novo assembly methods . Overall, the de novo assembly resulted in 94,759 contigs larger than 1Kb including 112 larger than 10Kb, in 30 strains. Distance clustering of substrain-assemblies showed strong resemblance to the phylogeny of R. norvegicus (Fig 4).
There were 3 samples that yielded a larger amount of contigs, as well as larger contigs: the males (Da/BklArbNsi and F344/NHsd) and F344/Stm, which were sequenced using mate-pair sequencing, a different library preparation approach with a larger insert size. Several other samples resulted in low-quality assemblies, in particular SBN/Ygl and SHR/NHsd. These 2 samples yielded no contigs larger than 1Kb and had an average N50 of 129bp. These samples had a relatively low sequencing coverage, although other well performing strains had a similar low coverage (Figure B in S1 File). These two samples do show a strong shift in GC-content between the total set of reads and the reads used for de novo assembly (Figure B in S1 File). Strong GC-bias (difference >2%) has been shown to aggravate de novo assembly due to unequal coverage in the genome .
Identified sequences and peptides
To distinguish between sequences present in the majority of strains (presumably missing in the reference genome and/or SD) and those that are strain-specific, we aligned the de novo assembled contigs of all samples and denoted the contigs identified within a single strain to be strain-specific (Fig 1C). By performing a BLAT search for all contigs against the rat trace archives, [v105, ftp://ftp.ncbi.nlm.nih.gov/pub/TraceDB/], we found that 0.1% of the common contigs showed a reciprocal overlap of 66% or more with rat trace archives (0.0028% with ≥85% or more), indicating that only a very small portion overlaps or extends existing archive sequences while the majority are novel sequences and are unlikely to represent gaps in the current rat reference genome. For the contigs that were classified as strain-specific, this was 0.02% (0.0018% with ≥85% overlap), showing an even greater enrichment for novel sequences, as expected.
On average 3.36Mb of assembled sequence was classified as strain-specific (30,432 contigs >1Kb and 51 contigs >10Kb). The average size of common assemblies was 4.1Mb (85,741 contigs >1Kb and 104 contigs >10Kb) (Fig 5). Contigs of sufficient length, ≥ 500bp, were used for further analysis and have been made available through the European Nucleotide Archive under accession PRJEB12009.
Fig 5. Annotation of strain-specific and common contigs.
Contigs that overlap with RepeatMasker in gray, Contig alignment to mouse assembly GRCm38 blue = negative, green = positive, and contigs containing predicted coding sequences in yellow and orange for both protein coding and mouse aligned contigs.
Both common and strain-specific assemblies contained between 20 and 40% repeat sequences, with large contributions form LINE and SINE elements. These elements have been described to be an important source of intra-species divergence [35,36]. Comparison of the contigs to the mouse genome GRCm38, the closest species to rat with a high quality reference sequence, revealed a slightly larger percentage homology to mouse in strain-specific compared to common sequences (26.99Mb (3.9%) strain-specific vs. 61.53Mb (4.2%), common). Finding more mouse-homologous sequences within strains compared to across strains suggests strain-specific deletions of ancient sequences (i.e. from the most recent common ancestor of rat and mouse).
We used various ab initio prediction tools to identify putative coding sequences in the assembled contigs. Putative Open Reading Frames (ORFs) were found in 1,589 strain-specific contigs (~1.92Mb) and in 1,726 common contigs (~2.83Mb) (Fig 5). OrthoMCL identified 1,270 (79.9%) strain-specific and 1,290 (74.7%) common putative ORFs as orthologs of known proteins.
The orthologs of both the strain-specific and common contigs were predominately found in mouse, rat and other rodents: these contigs potentially include (pseudo) genes and gene duplications. We find that mouse-homologous contigs contain the least amount of repetitive elements (20%). Contigs without mouse-homology (non-mouse contigs) contain 40% repetitive elements, suggesting that these could have been introduced through viral retro-transposition or genomic instability in between repetitive regions. Paired-end information was used to identify the genomic location of the contigs, resulting in 70.3% of the contigs being linked to multiple repeat-regions in RGSC5.0. Indicating that there are small (<1Mb) sequences, of possible biological relevance, interspersed within repeat-regions.
To assess the function of the ORFs, we used the Gene Ontology terms of their closest orthologs and performed a gene set enrichment analysis using the Panther 9 algorithm . We found that the biological processes of oxidative phosphorylation and metabolic processes were significantly overrepresented in the data compared to the rat reference (P<0.05). These processes have been shown to have a high amount of redundancy and plasticity [41,42], which could explain their abundance in strain-specific and evolutionary dynamic regions. A doubling of the expected amount of proteins with the molecular function of RNA-directed DNA polymerase activity was also found (Figure C in S1 File). This, in combination with the significant increase in reverse transcriptases (P<0.01), is strongly indicative of an active role of retroviral elements in laboratory rat strain evolution .
Protein structure comparison: Larp1b and Klb
We analyzed the predicted secondary structure of 2 randomly chosen in silico-translated contigs, one strain-specific and one common, with a length close (±1SD) to the mean length. The strain-specific contig S14218 of F344/NHsd showed a 98% similarity with mRNA of rat La ribonucleoprotein domain family, member 1B (Larp1b). The first 9 exons of the Larp1b gene are found adjacent to each other in the contig, while exon 10 is missing. Secondary Structure Prediction (SSP) shows strong similarity in all predicted secondary structures, except for the C-terminal region (Fig 5). The last 228bp of the contig were found to be DNA of the interspersed repeat class. Finding a genomic contig with a high similarity to mRNA, in combination with the non-LTR retrotransposon evidence, strongly suggests a pseudogene (instead of duplication in this strain or deletion in other strains): an insertion of host Larp1b-cDNA into the genome of the sequenced F344/NHsd sample.
A common contig identified in five substrains (WAG/Rij, SHR/NCrlPrin, WKY/NCrl, SHR/OlalPcv and F344/Nhsd) harbors a spliced incomplete and/or altered klotho-beta (Klb) homolog based on the overlap of common contig-SSP and Klb from M. musculus. Liver transcriptome data of SHR/OlalPcv indicates that parts of this ORF are expressed (Figure E in S1 File). Previous work in mice showed the correlation between Klb-deficiency and chronic renal failure, ageing and altered plasma Ca2+ -levels . Interestingly, four of the sub-strains containing this contig have been phenotyped  and show significantly lower levels of plasma Ca2+ compared to available data of other strains similar to our dataset (P<0.05). Our finding of the altered Klb and the previous studies suggest that this alternative Klb may influence the cellular calcium homeostasis and other functions of the klotho-family (e.g. endocrine factor and co-receptor of Fgf23) in the identified sub-strains but this association clearly warrants further research.
In vitro validation
We performed a small-scale PCR-based experiment on 6 contigs to assess both the quality of assembly and the assignment of strain-specificity (Fig 6). All 6 contigs were confirmed in the strain it was identified in, showing that our assembly-methods were correct. The assignment of specificity was correct for the assessed strains, the 5 common contigs were found in more than one strain and the strain-specific contig was only found in its own strain.
Fig 6. In vitro validations; PCR-primers (in duplo) for 6 contigs in 3 strains.
Colored bars denote the strain in which the contig was identified; line color and type indicate contig classification; strain-specific (solid cyan) or common (dashed magenta).
In our quest to elucidate the origins of unmapped read-pairs of 33 rat-strains, we found sources related to both the wet- and dry-lab procedures. Most importantly, extensive strain-specific genomic segments were found, including regions with potential biological functionality.
The amount of low-quality and short-length reads is less than 50% of all unmapped reads for all strains. This is in line with previous work that showed that less than half of unmapped reads are sequencing artifacts . Since base-call quality score is based on a prediction of the probability of an error at a particular base  and such predictions are typically on the safe side, a small amount of low quality reads could have been filtered out incorrectly. Lower quality scores have also been correlated to difficult sequences, including reads with very high and low GC-content and simple sequences (e.g. mononucleotide repeats). Stringent filtering for quality would lead to lower sequence coverage, which may be particularly problematic in regions that have a high GC-content, such as known regions with specific regulatory functions [47,48].
By comparing the current rat reference genome with WGS data obtained from the same animal that was used for creating this reference, we found that 39% of the total unmapped reads are due to missing sequences in the reference genome. As such the use of a single-strain reference pipeline (with a separate strain used for the Y-chromosome) is questionable. We identified regions missing in the reference genome that are common in rat strains as well as regions that are strain-specific. Population specific loci are also found in other genomes: for example the 17q21.31 region in the human genome has a megabase-long population-specific inverted haplo-block and as such has an alternative locus (path) in the reference genome assembly [49,50]. Future work should be focused on creating population-specific alternative paths in reference genomes and a low-memory reference-guided assembly pipeline, which would be more resilient towards population-specific genomic inserts and/or deletions. Recent results and analysis show that the use of alternative paths can lead to more complete analyses and are less sensitive to missing sequences in the reference genome .
High amounts of bacterial and viral contaminant-derived reads were identified in several samples. While we focused on unmapped reads, more in-depth research of contaminants could investigate the amount of contaminant-derived mapped reads which lead to falsely mapped reads in the reference, similar to a recent study in humans . Another source of contamination is inherent to the sequencing methods used: adapters and positive controls. Of the contamination, a large part can be attributed to the positive control sample (phiX) of the Illumina platform. Sequencing adapters are found in this study to lead to high amounts of (unnecessary) unmapped reads, as these adapters lead to mismatches during mapping. A pre-mapping filtering script is generally not used, since such a step is computationally heavy and unmapped reads are not used in analyses. However, we show that this could lead to a potentially large amount of contaminant-derived reads.
The distribution of ‘unmapped reads’ across strains follows the (polymorphism-based) phylogeny, while this could be expected it was never quantified in an unbiased way. Previous studies have shown that there is similarity between contigs of unmapped reads in mice-strains and that there is a biotype-based clustering in aphid unmapped reads [12,15]. Our study shows that there is up to 90% similarity between unmapped reads of samples with a shared ancestor. Even though we filter against Y-chromosomal BACs a large portion of Y-chromosomal contigs remains in the data, as strains cluster firstly on gender, rather than phylogeny. The incomplete removal of this signal using BACs is likely due to the high variability of the Y-chromosome between strains: the strain (SHR/Akr), used for the Y-chromosomal BAC-contigs, has a small Y-chromosome compared to BN, which ranks amongst the largest .
When comparing the total size of strain-specific genomic segments per sample in our study with a previous study in mice , we find similar amounts of strain-specific genomic segments in inbred rat strains. The effect on sequencing depth and library size on de novo-results is also apparent: F344/Stm (mate-pair), F344/NHsd (35.4x) and DA/BklArbNsi (32.6x) have an approximately tree-fold higher amount of strain-specific bases in their assemblies. Interestingly, there appears to be a trend between higher amounts of total sequencing reads (e.g. DA/BklArbNsi, F344/NHsd and LE/Stm) with respect to the amount of repeats in their respective contigs. This could be due to a higher amount of retroviral activity, but is more likely to be the results of higher coverage resulting in better de novo assemblies.
The finding of putative protein-coding regions in the strain-specific contigs and the high percentage of interspersed elements is in line with the previous studies on (non-)LTR retro-transposons and the evolution of the genome in mammals [19,20]. Due to the limited size of the contigs, we do not always have sequence extending up to the poly-A tails and/or promotors, which prevents the distinction between functional (i.e. duplicated) genes or pseudogenes.
The identification of the (pseudo)gene of Larp1b is a good example of the potential impact of an assembly based approach for unmapped-reads. The contig contained a sequence highly similar to the mRNA of Larp1b with a bordering SINE-sequence, it has a near-complete ORF without introns (the predicted secondary protein structure is similar, Fig 7). The altered Klb in strains with lower amounts of cellular calcium homeostasis is an example of ability of our analysis to find a potential causal relationship based on matching strain distribution patterns of genotypes and phenotypes.
Larpb1 in F344/NHsd A) Alignment of F344/NHsd contig S14218 with rat Larp1b-mRNA shows strong similarity with the first nine (of ten) exons. B) Superimposed SSP of F344/NHsd contig S14218 on SSP of Larp1b mRNA. Only the N-terminal structure, in black, -denoted with the arrow- is missing from S14218.
This study shows that a large portion of unmapped reads in the rat model system are of biological interest, rather than sequencing-artifacts. This is largely due to the use of an incomplete reference genome, derived from a single substrain, which is a problem for all publically available genomes apart from human. The large amounts of biologically relevant data in the unmapped reads, as demonstrated by their potential to reconstruct the phylogentic tree, highlights the amount of sequences missing in the reference-strain, indicating that all re-sequencing experiments should account for the shortcomings of using a single reference genome. Using multi-path genome assemblies will be a partial solution to this but it fails to account for recently acquired material. The efforts described here can contribute to the identification and creation of alternative paths, while also identifying novel sequences in a systematic way. Until alternative paths are systematically included and part of analysis procedures, researchers should be aware of the missing sequences or use approaches such as described here to detect these. Rat based studies can use our resource for mapping and in-depth analysis of strain-specific genotype-phenotype relationships. Analysis of the strain-specific genomic segments resulted in the discovery of candidate coding segments and homologous sequences, which both have the potential for playing a role in the strain-phenotypes.
For this study, paired-end Illumina HiSeq WGS-data of 30 rat strains was used [17,24,26]. An additional Strain of Unknown Origin (SUO) was added to the dataset . SHR/OlalPcv paired-end WGS-data (Illumina GAII-platform) and the mate-pair data of F344/Stm was also included, which brought the total amount of strains to 33 [24,52] (Table 1). The raw data was mapped against the Brown Norwegian (BN/SsNHsdMCW) reference genome, version 5.0 (RGSC5.0) with BWA mem (0.7.5a-r405), base quality scores were recalibrated with the genome analysis toolkit 3.1–1  and PCR duplicates were removed with Picard tools MarkDuplicates 1.118 . Unmapped pairs were extracted with SAMtools 0.1.14 . Pairs with only one mapped mate were also extracted: these were mapped with the anchored split-read mapper Pindel 0.2.5a1 . Correctly mapped pairs from Pindel (originating from deletions, short insertions, inversions or tandem duplications) are considered to be unmapped due to genomic variation (Fig 1A). A full description of the performed analysis, utilized third-party software and utilized custom scripts are available in the “GitHub” repository, http://git.io/scrapheap.
The Illumina TruSeq-2/3 adapters were clipped from the unmapped pairs, including the unmapped pairs of the Pindel-mapping, using Trimmomatic 0.30 . Base-quality clipping was done by using a 25nt sliding window with a phred33-score quality threshold of 25. The pairs were then filtered on read-length, where both reads must be longer than fifty percent of the expected size. Pairs that did not meet these criteria were considered to be sequencing artifacts and/or errors.
The remaining read-pairs were compared with the alternative reference genome, Y-chromosomal BACs and a contamination databases with an in-house pipeline (available on request). This pipeline maps read-pairs to all three databases with BWAmem (0.7.5a-r405) and extracts pairs that do not map in-pair at the correct insert-size against any of the databases. Furthermore, it generates hit-counts of all three databases, including overlapping read-pairs (e.g. read-pairs that map to both Celera and Y) (Fig 1B).
The alternative reference for R. norvegicus from Celera Genomics, Rn_Celera, is composed of 29 million BN-fragments and 8 million SD-fragments . SOLiD WGS BN/SsNHsdMCW-reads were mapped against Rn_Celera and the mapped regions were selected . This gives the possibility to differentiate between read-pairs mapping to true SD-specific regions or to missing regions in RGSC5.0. This, because Celera-regions mapped by BN/SsNHsdMCW SOLiD-reads are considered to be regions that should also be in RGSC5.0, which is an BN/SsNHsdMCW-assembly.
Data from the Rattus norvegicus Chromosome Y Mapping Project, containing BAC-contigs from Solexa- and Sanger-sequenced SHR/Akr, is used for finding Y-chromosomal read-pairs . Reads from repeats and from cross-genome homologous regions could also map to these BAC-contigs: this would lead to high numbers of false positives (e.g. the identification of Y-chromosomal read-pairs from (fe)male samples). To investigate this, male and female paired-end WGS-data was aligned to the BAC-contigs with BWAmem, extracting all properly mapped pairs . In mice, 50–66% of the Y-chromosome synapses with the X-chromosome . If the BAC-contigs are complete and a correct representations of the rat Y-chromosome and these putative percentages of pseudoautosomal regions (PAR) in mice are similar in rat, similar percentages mapped read-pairs of the female compared to the male data should be found (Supplemental text 1). For the scope of this study, however, all mapped read-pairs to the (putative PAR-regions of) BAC-contigs were removed from the dataset: they were considered to not be strain-specific, as they map to (the Y-chromosome of) SHR/Akr.
The prokaryotic, viral and fungal RefSeq-genomes (V.61) were used for the contaminant-database. Contamination is defined here as non-animalia and can be the result of (in)direct contamination of a sample or from a positive control in the sequencer, like the bacteriophage PhiX174 of the Illumina HiSeq-platform. In order to keep the amount of false positives (i.e. reads mapping to genomes of closely-related species to rat) low, we did not include animalia-genomes in this step. If contamination of animalia was present in the dataset, we were able to identify this in the OtherMCL-analysis. To produce an overview of the contaminating bacteria, mapped regions in the bacterial genomes were pooled at the genera-level. This was found to be the deepest taxonomic layer possible for the resolution of 500bp, the insert size of the read-pairs, due to the high percentages of whole genome sequence identity within bacterial genera .
Comparison of relevant reads
Compareads 2.0.2  was used for the comparison of the remaining read-pairs of different Illumina paired-end sequenced strains to each other. GK/Ox, SBN/Ygl, WKY/NHsd and WAG/Rij were discarded for this analysis. This was done because the median read-length was too low (50nt), low amounts of read-pairs (<30k) and/or because of GC-bias. Since read-pair filtering was already done, only the direct comparison- and extraction-scripts of Compareads were used. For the comparison, two K-mers of 40% read-length were required to be considered similar.
De novo assembly pipeline
For the assembly of the remaining read-pairs, a SOAPdenovo-based pipeline was used on a 48-core Linux cluster with 500GB RAM. Pre-assembly optimization was done with SOAPEc v2.01 in HA-mode, which corrects sequencing-errors based on low-frequency Kmers : because the read-pairs were already filtered on base-quality, the phred-quality threshold was lowered to 30. Next, the optimal Kmer for de novo assembly was found using Kmergenie v1.5854 , which uses K-mer frequency-distribution estimation for finding the Kmer with the maximum amount of unique Kmers, leading to more accurate de novo assembly-inputs.
SOAPdenovo v2.04 was used as the primary assembler, due to the ability to handle large genomes and 100-bp Illumina paired-end sequences, while keeping the computational burden low. This assembler was also used in the de novo assembler-comparison Assemblathon 2 for a similar dataset . Because the found optimal Kmers were never higher than 59, the 63mer-version was used. Post-assembly optimization, filling gaps in scaffolds and error-correction of contigs derived from reads with incorrect insert-sizes in the contigs, was done with the GapCloser tool, which was specifically designed for downstream analysis of SOAPdenovo results. A distance-plot was made by aligning all strain-assemblies to each other with BLAT version 35x1  and calculating the distance score as described in .
Strain-specific and common contigs
The assembled sequences of each substrain were aligned to all other (sub)strain assemblies with BLAT version 35x1 . Substrains were considered to be from the same strain if they a) have the same strain-name (i.e. the name in front of the slash) and/or b) have a genetic distance of less than 0.02 in the paper of . A Python-script (available on http://git.io/scrapheap) filtered out contigs longer than 500bp, have a matching sequence of more than 100bp and an overall similarity of more than 80%. Common contigs were clustered with CD-HIT version 4.5.4  and clusters of more than 2 sequences were aligned with Clustal Omega version 1.2 .
Sequence prediction and validation
Analysis of repetitive elements in the contigs was performed with RepeatMasker 4.0.3, using the Repbase-derived R. norvegicus library version 20140131 [69,70]. Augustus 3.0.1  was used for ab initio gene prediction of the non-clustering contigs, using additional rat EST-data from UniGene as extrinsic information. The resulting peptides were then assigned to orthologous groups using OrthoMCL 5 . Using the hypothesis that the chance of finding de novo proteins is magnitudes lower than finding orthologous proteins, we only keep the predicted protein with an assigned orthologous group. For the analysis of the examples, we also used SWISS-MODEL for automated protein structure modeling . Strain-comparison statistics were calculated with the Welch T-test.
In vitro validation
To assess both the quality of assembly and the assignment of strain-specificity, we performed a small-scale experiment. Primer-sets were designed for contigs of three different sub-strains and PCR was performed on samples of each strain. Primer sets form common contigs should lead to amplification of the sequence in the strain it was identified in as well as in other strains. Strain-specific contigs should yield primers that only lead to amplification in samples belonging to that specific strain.
The authors would like to thank Sander Boymans, Victor Guryev and Wim Spee for their scientific discussions. Furthermore, we thank Tadao Serikawa for making the F334/Stm unpublished data available to us. This work was financially supported by the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement No. HEALTH-F4-2010-241504 (EURATRANS) to EC.
- Conceptualization: EC MS RHW.
- Data curation: RHW.
- Formal analysis: RHW MS JdL.
- Funding acquisition: EC.
- Methodology: MS RHW JdL.
- Project administration: EC JdL.
- Resources: RHW RH.
- Software: RHW JdL.
- Supervision: EC MS JdL.
- Validation: RHW PT.
- Visualization: RHW JdL.
- Writing - original draft: RHW JdL.
- Writing - review & editing: RHW EC MS RH JdL.
- 1. Cullum R, Alder O, Hoodless P a. The next generation: using new sequencing technologies to analyse gene regulation. Respirology. 2011;16: 210–22. pmid:21077988
- 2. Bateman A, Quackenbush J. Bioinformatics for Next Generation Sequencing. Bioinformatics. 2009;25: 429. Available: http://bioinformatics.oxfordjournals.org/content/25/4/429.full.pdf. pmid:19202193
- 3. Nowrousian M. Next-generation sequencing techniques for eukaryotic microorganisms: sequencing-based solutions to biological problems. Eukaryot Cell. 2010;9: 1300–10. pmid:20601439
- 4. Lohse M, Bolger AM, Nagel A, Fernie AR, Lunn JE, Stitt M, et al. RobiNA: a user-friendly, integrated software solution for RNA-Seq-based transcriptomics. Nucleic Acids Res. 2012;40: W622–W627. pmid:22684630
- 5. Schmieder R, Edwards R. Fast identification and removal of sequence contamination from genomic and metagenomic datasets. Rodriguez-Valera F, editor. PLoS One. Public Library of Science; 2011;6: e17288. pmid:21408061
- 6. Cibulskis K, McKenna A, Fennell T, Banks E, DePristo M, Getz G. ContEst: Estimating cross-contamination of human samples in next-generation sequencing data. Bioinformatics. 2011;27: 2601–2602. pmid:21803805
- 7. Lusk RW. Diverse and widespread contamination evident in the unmapped depths of high throughput sequencing data. 2014; Available: http://arxiv.org/abs/1401.7975.
- 8. Fujimoto A, Nakagawa H, Hosono N, Nakano K, Abe T, Boroevich KA, et al. Whole-genome sequencing and comprehensive variant analysis of a Japanese individual using massively parallel sequencing. Nat Genet. 2010;42: 931–6. pmid:20972442
- 9. Kostic AD, Ojesina AI, Pedamallu CS, Jung J, Verhaak RGW, Getz G, et al. PathSeq: software to identify or discover microbes by deep sequencing of human tissue. Nat Biotechnol. 2011;29: 393–6. pmid:21552235
- 10. Detecting and Estimating Contamination of Human DNA Samples in Sequencing and Array-Based Genotype Data. Available: http://ac.els-cdn.com/S0002929712004788/1-s2.0-S0002929712004788-main.pdf?_tid=2a1141ca-0f41-11e4-9beb-00000aacb360&acdnat=1405773508_79efc7e526d314921cc1d56df8543b99.
- 11. Bao S, Jiang R, Kwan W, Wang B, Ma X, Song Y-Q. Evaluation of next-generation sequencing software in mapping and assembly. J Hum Genet. Nature Publishing Group; 2011;56: 406–14. pmid:21525877