Why filter SNPs which align on both strands?

A quick learning point for those grappling out with genome alignments, BAM/SAM files and SNP detection.

When aligning short-read paired-end data against a reference, you can often end up with spurious SNP calls as a result of insertions or deletions when comparing your sequence to a reference.

This is what it looks like (visualised via Savant Browser). In this case I was aligning some Illumina 35-bp paired-end data against a draft bacterial genome with Bowtie.

The SNPs will be called with high confidence and are apparently correctly paired with reads. But crucially see how the reads nearest the deletion (the blank space in the middle) point only in a single direction (as determined by the different shade of blue). Those reads are paired, but the pairs are all located away from the deletion. There will also be paired reads that span that deletion but they don't get aligned because they are seem to be greater than the maximum allowed insert size (250-bp is the default for Bowtie and can be changed with the --maxins parameter).

You can filter these SNPs in a few ways as they are very likely to be erroneous, I use VarScan and like to insist that true SNPs are supported by reads on both strands. Also note that depth of coverage is typically lower in these regions because the alignment tails off.

To make this even clearer, if you go and align the same data but don't tell Bowtie the reads are paired (i.e. they behave as if they are fragment reads), you get the following result. Shown alongside the paired data for contrast.

Now you can see that reads are aligned in both orientations right up to the deletion, meaning that filtering on both strands is no use if you are dealing with fragment data. But you can still filter on read depth, or perhaps you could filter on proximity to the end of an alignment.

Of course, detecting patterns like these is the job of breakpoint detectors for structural variation discovery (such as BreakDancer), but in this case I am talking about the pitfalls of SNP calling specifically.

You may have discovered other techniques for filtering - please post in the comments if so!