In the last few years, the Heliconius community has generated a large quantity of next generation sequencing data, mostly from Illumina sequencers. One of the first steps in the analysis of this data is the alignment of short reads to reference sequences, whether BAC sequences, transcriptomes or whole genomes. There are over 70 read alignment tools available (Fonseca 2012), which has been thought to be too many. Which one(s) should we use for Heliconius butterfly data?
Several popular, fast aligners such as BWA and Bowtie are known to work well on human genomes. But Heliconius genomes, radiating over around 10 million years, are considerably more divergent than human genomes. In July 2011, I ran some aligner tests to see what the impact of divergence was on aligner performance. In those tests, Stampy performed considerably better than other aligners. Since then, we have mostly used Stampy for read alignment, notably for (I think) all of the alignments in the Heliconius genome + introgression paper.
Stampy has a serious drawback, which is that it is extremely slow compared to almost all other aligners – four or five times slower than Bowtie. This has been very inconvenient in the past and is only getting worse as data set size increases. I am about to analyse 1.8 billion reads from 70 whole genomes. Aligning this amount of data with Stampy will take many weeks even over many tens of compute cores. There have been several new aligner tools released in the time since I did the last test, so I thought it was time for another comparison.
Be warned, this is a very basic comparison done over a few days, with many flaws. If it was good enough to publish properly I would do so. It really only gets the ball rolling again, but I don’t really have the data to come to any firm conclusions, just to outline what the problems are.
Three whole genome sequences were used, one from a Heliconius melpomene melpomene x Heliconius melpomene rosina cross, one from Heliconius ethilla (09-49) and one from Heliconius erato erato (NCS_2556). These three genomes should be increasingly divergent from the Heliconius melpomene melopmene reference genome (see tree at Tree of Life). Each sample has 4m paired end 100bp HiSeq reads or just over. This is only about 3 times coverage of the genome, so we don’t expect to see every base covered or very accurate SNPs called. But using such limited data means the aligners can be tested relatively quickly. (I leave for another day the serious issue of what we might expect from an alignment of H. erato and H. melpomene, and how much sense it makes to align erato or silvaniform clade genomes to the H. melpomene reference, given their somewhat large divergence.)
1. Align with mostly default options.
2. Convert to sorted BAM with readgroups using Picard AddOrReplaceReadGroups.
3. Run samtools flagstat to get basic mapping statistics.
4. Run GATK UnifiedGenotyper to call reference and SNP bases. Use -out_mode EMIT_ALL_CONFIDENT_SITES, -stand_call_conf=1 and -stand_emit_conf=1; the samples used are very low coverage so good SNPs are likely to have qualities lower than the default threshold of 30. 1 is too low, but it may reveal weaknesses in the aligners.
5. Hack coverage and SNPs at various different qualities using one line Perl scripts.
This analysis is OK as far as it goes, but there are several problems with it.
1. It should be possible to improve aligner performance by tuning parameters. I did try a handful of alternative running modes (Bowtie 2′s end-to-end and local, for example) but have not searched numerical parameter spaces at all (see comment below on BWA).
2. Reporting what GATK makes of an alignment does not necessarily reflect the quality of the alignment (see Mihaela Pertea and Steven Salzberg’s takedown of a similar comparison).
3. Mapping quality scores and things like ‘properly paired’ counts are assigned by the aligners themselves and may not reflect actual quality; it would be possible to write an aligner that dumped reads randomly but assigned maximum mapping qualities, which would come out quite well on many of the following metrics (although it might be expected that GATK would call a large quantity of spurious SNPs, of which more later).
4. Performance times are unlikely to be accurate. All jobs were run on a 48-core server (4x 12-core 2.2GHz AMD Opteron 6174s) with 514 Mb RAM. Alignment jobs were run in parallel on 36 cores (for all jobs except bwa sampe, which can only be run on a single core). The server is quiet, but it is quite possible that other jobs or IO bottlenecks slowed down the alignment jobs, which were only run once. Or input data may have been loaded into RAM for later alignments. Still, I expect the orders of magnitude to be correct.
5. There are many other metrics one can think of to assess aligners, many of which have been used in other aligner comparisons, but I don’t like simulated data, and we do not have a well-validated set of real reads with known positions to test against at present.
With over 70 aligners available, which to choose? I have, of course, been far from comprehensive. I really just wanted to try Bowtie 2, but I did attempt a mildly systematic search for other potentially useful aligners. I used the HTS Mappers page and limited myself to DNA aligners that process Illumina data, are quality aware, handle paired ends and run multithreaded. Note that I probably ignored several good mappers by doing this; for example, I would have excluded Stampy because it is listed as not being multithreaded, as this feature has only recently been added. I then excluded several aligners because previously published data showed they were likely to be inferior to other aligners (for example, the SeqAlto paper shows SNAP to be very fast but perform poorly against most other aligners, particularly on divergent data, so it was unlikely to be useful for Heliconius data). I also ignored a few because they appeared to be abandoned or were commercial (CLC, Novoalign – Novoalign tests in July 2011 showed it to perform worse than Stampy for our purposes, as also shown in the Stampy paper). If I have missed your favourite aligner and you think it is worth testing on divergent data, please do let me know.
BWA is used throughout as a kind of control, as it is what we used to use and what is still widely used by many others (and recommended by GATK for example). It does not perform very well in any test, but perhaps it could be tuned to do much better, as it uses a seed of 32 bases which is very large for divergent data. But as it is often tempting to simply run using defaults, it is worth seeing just how bad the performance can be with divergent genomes.
H. m. melpomene x H. m. rosina read mappings
Most of the aligners have been tested on the H. m. melpomene x H. m. rosina cross individual, as that’s the dataset I’m most interested in now. I took only the best performing aligners on to the H. ethilla and H. erato comparisons, assuming that aligners that perform badly on nearby genomes will do worse on divergent genomes.
Samtools Flagstat for melpomene x rosina
36 core running time
Mapped reads (%)
Properly paired reads (%)
|BWA||3:14 * 2 + 22:31||28:59||79.42||75.02|
|Bowtie2, endtoend, very-sensitive||10:52||10:52||83.46||74.02|
|Bowtie2, local, very-sensitive||08:01||08:01||94.15||88.45|
|BWA+Stampy||28:59 + 26:22||55:21||94.24||84.41|
|Bowtie2-endtoend+Stampy||07:04 + 23:13||30:17||94.28||86.49|
|Bowtie2-local+Stampy||07:15 + 27:38||34:53||94.30||86.06|
|Smalt+Stampy||12:28 + 36:08||48:36||94.24||84.90|
Timing notes: the BWA timings show times for the aln step (one for each read) plus the sampe step. Early versions of Stampy could use BWA in a hybrid mode, rapidly mapping easy reads with BWA and then mapping the remaining unmapped reads with Stampy. Recent versions can now take any BAM file as input for this hybrid mapping, so any aligner can be used in this way. The X + Stampy rows show this hybrid mapping, with the time for aligner X given first and the subsequent Stampy run second.
Looking at numbers of reads mapped is not particularly informative, because exact mappers like BWA and Bowtie 2 with the default end-to-end mapping setting (where the whole read has to map) will inevitably do worse than the partial mappers like Stampy and Bowtie 2 with the local mapping setting (where only part of a read has to map). But it does indicate that Bowtie 2 with local and very-sensitive settings and SMALT are both competitive with Stampy, whereas SeqAlto is not, on this data at least. It also shows that small improvements can be achieved by running Stampy in hybrid mode with any other aligner, but the performance improvement is not that great. Sadly, running Stampy in fast mode achieved nothing like the 50% speed up the manual claims, with slightly degraded results. Both Stampy runs lag way, way behind Bowtie 2, SMALT and SeqAlto; it will make life much easier if we can convince ourselves that one of these aligners is more accurate.
H. ethilla and H. erato read mappings
Only BWA (as a control), Bowtie 2, SMALT and Stampy were tested on the more divergent genomes. Bowtie 2 was run with the local and verysensitive options as these options did not degrade performance very much in the H. melpomene tests.
36 core running time
Mapped reads %
Properly paired %
|BWA||3:37+3:35+19:56 = 27:08||52.00||46.26|
|Bowtie2, local, verysensitive||08:30||89.59||78.82|
36 core running time
Mapped reads (%)
Properly paired reads (%)
|Bowtie2, local, very sensitive||06:29||61.32||34.97|
Clearly BWA fails horribly on H. erato and does badly on H. ethilla. Bowtie 2, SMALT and Stampy all appear to be competitive, and cannot be separated on these metrics. While SMALT maps more reads, Bowtie 2 has more properly paired. As expected, Stampy’s performance is poor; Bowtie 2 is fastest, but SMALT is acceptable.
How many bases mapped?
As mentioned above, counting number of reads mapped is not very informative when partial mapping is allowed. It is much better to count number of bases mapped. The following charts show the total numbers of bases mapped for H. melpomene, H. ethilla and H. erato, broken down by mapping quality as recorded by GATK in the MQ field. Each bin value is a lower bound, so ’30′ means all bases with mapping qualities from 30 to 39.
This chart shows that, with data from similar genomes, all aligners do reasonably well. Actually, quite a lot of bases are not covered across the genome (around 200 million out of 273 million), but that’s presumably because only 4 million reads have been aligned. SMALT actually performs slightly better than Stampy (an extra half a megabase aligned), but any of the aligners would do a good job. Using Stampy in hybrid mode does not beat the SMALT result, except when paired with SMALT, when an additional megabase can be aligned, at the cost of quadrupling the running time.
Also, intriguingly, the chart shows how varied mapping qualities can be. BWA and SMALT don’t call anything over 60, Bowtie 2 doesn’t call anything over 40, but Stampy hands out a large amount of qualities in the 90s. Clearly there is some degree of bullshitting here, if we interpret mapping qualities as true phred scores – a mapping quality of 99 would mean there is a 1 in 10 billion chance of it being wrong. However, I don’t necessarily see a problem with this as long as thresholds are chosen appropriately. For example, in the Heliconius genome paper, we used a threshold of 67 for mapping qualities. This would have excluded all base calls from BWA, but it worked well for Stampy.
This effect also means that it is possible comparisons such as Heng Li’s ROCs are not entirely appropriate; for example, perhaps these curves retain a lot of Stampy bases at Qs of 20 or 30, which Stampy considers to be poor quality bases that should be rejected, and if a higher threshold was used it would produce a much better curve. But then, perhaps it shouldn’t output such high mapping qualities.
For H. ethilla and H. erato, there is a reduction in number of bases aligned as diversity increases, as expected, but Stampy maps several million more bases than any other aligner. Both Bowtie 2 and Smalt do reasonably well, however, and, depending on the analysis, it may be preferable to trade coverage for performance.
How many SNPs mapped?
The variations in a genome are usually our useful data points, rather than the number of bases. The following charts show the number of SNPs called by GATK and their mapping qualities.
Interpreting this chart is not straightforward because we have no way of deciding whether the SNPs are real or not, so it is not clear what is better. The Stampy alignment produces the most SNPs, but if it places a lot of reads incorrectly with highly exaggerated SNP calls, then GATK will call a lot of high quality SNPs from that alignment.
It may help to compare the number of bases mapped to the number of SNPs called. We might expect that the number of SNPs called should shrink or grow proportionally to the number of bases called. However, although Stampy aligns half a million more bases than SMALT, it also produces half a million more SNPs than SMALT, which is considerably more than we might expect proportionally. This makes me suspect that SMALT might be a sensible, conservative alternative here, rather than simply going for the aligner with the most SNPs. But then, maybe BWA would be an even more sensible option, as it maps most of the bases but apparently creates many fewer SNPs, with lower mapping qualities overall.
The following slide is taken from a talk on SMALT by the authors, and indicates that Stampy may indeed be overexuberant when aligning:
This indicates that the extra few percent of mapping that Stampy does is actually mostly erroneous, at least for nearby genomes. So to reduce false positives, SMALT appears to be a better option. But this is a test using simulated data by the authors of the tool, so may not be true in the wild (and may be falling foul of the mapping quality exaggeration issue discussed above – perhaps we could easily exclude the erroneous mappings by just raising our mapping quality threshold).
For H. ethilla and H. erato, the number of SNPs called is more in line with the number of bases aligned, and Stampy has a considerable edge over the other aligners, as it did for number of bases mapped. (As we would hope, H. ethilla has more SNPs than H. melpomene. However, H. erato actually has less, presumably because the most divergent regions simply don’t align.)
At present, the data above don’t give us enough information to make a choice once and for all. I think I would still prefer to use Stampy over other aligners if I was working with silvaniform clade or erato clade genomes and I had to align them to the H. melpomene genome. But for aligning melpomene clade genomes, I will probably use SMALT for now, and am using SMALT for the current whole genome alignments. It provides the best tradeoff between performance and coverage (and apparently accuracy, but it is too early to confirm that). It means I can get this set of alignments done in a week, rather than two to three weeks with Stampy.
There is obviously a lot more that could be done here, but I wonder if it is worth doing. Obviously if erroneous data is getting into our analyses, that’s a problem, especially if tools are labelling erroneous data with inflated quality scores. But I think I would prefer to figure out better biological filters to apply to the data after alignment than taking several weeks of work to assess aligners in more detail.
However, I would be keen to at least pool opinions on this – should we validate our aligners in more detail? Have we seen any evidence of systematic errors getting into final data sets because of aligner problems? Are there other simple tests we could or should do that would settle the issue?