Python for next-generation sequencing

python-logoA group of my scientific colleagues who are working on novel next-generation sequencing approaches, recently approached me about developing some computational simulations of the processes underlying these new methods with a view to demonstrating their feasibility, but also to optimize them for use with large genomes – specifically our own human genome. This article is not about these new sequencing approaches per se, which are still under development and about which I am not in any case, at liberty to talk – but rather it is more a story about how a challenging genome sequencing project demonstrated once again, what an awesome tool Python can be, and inspired me to pen another “Python-for-the-biologist” mini tutorial.

The human genome is about 3 gigabases (3.0 x 109 bases) in length, and for sequences of such a size, there can be significant technical challenges both in the laboratory, to accurately and reliably capture the sequence data from the relatively delicate, nanoscale biopolymers that comprise the genomes of organisms – and computationally, in the storage, handling and analysis of the very large datasets that are generated by these approaches.

Our first inclination when confronted with the computational processing of these large sequence datasets, was to trawl the web for one of the various, specialized, open-source bioinformatics software suites that have been extensively optimized for such tasks. Due to the nature of the new sequencing approach being developed however, there turned out not to be a software package capable of processing sequence data in exactly the way that we needed. My subsequent inclination once we realized this, was to turn to my favorite programming language and all-round, go-to tool for computation, Python. Some years ago, in a talk I gave at a biotechnology company, I described Python as the “Swiss Army knife of programming languages” – an appraisal I feel is even more true today than it was when I first gave it. Tempting though it might be to dismiss Python – an interpreted programming language – as being too slow for handling the processing of large genomic datasets, it turns out that there is a highly evolved and optimized part of the standard Python library that makes it ideally suited for such a task and believe it or not, it is also extremely fast!

Sequence alignment tools for biology generally implement one or more of a number of sophisticated alignment algorithms such as Smith-Waterman or Needleman-Wunsch. These algorithms are extremely generalized and capable of scoring and optimizing alignments even between sequences with relative insertions or deletions. In the cases that we were exploring for this new next-generation sequencing approach, such algorithms were overkill for our purposes since the kinds of patterns that we needed to match in our genomic searches were much simpler.

This seemed to me like a job for regular expressions – something for which Python offers the coder a very mature, evolved and highly optimized library*. If ever there was a corner of computer science that it would really behoove the biologist to learn for the sake of his or her own career, it might well be (in my humble opinion) regular expressions.

* The Python library links in this article are for Python 2.7 which is the version currently supported by most of the major scientific and numerical Python packages, as well as being the most commonly used release in the scientific Python community.

So what is a regular expression?

I will provide a few simple examples here, but a full discussion of the incredible power of regular expressions is the subject of whole books (like this for example) and would be way (way) beyond the scope of this simple article. I will however, endeavor to convey something of the flavor of regular expressions, along with some insights into how powerful they can be when applied to the analysis of biological sequences. (At this point I hear some yawning from the Perl guys at the back of the room – nothing new here for you guys I’m afraid, so this might be an opportune moment to grab another cup of coffee).

Consider the DNA sequence ‘aggatcgtaggcatgctgggcctatactggactc‘ (don’t bother searching it in BLAST – I just made it up for demonstration purposes) .

A regular expression can be as simple as a literal string like this ‘gga‘. We can search for this pattern in our sequence by importing the Python re module and using its ‘findall‘ function – like this:

import re
s = ‘aggatcgtaggcatgctgggcctatactggactc’
p = ‘gga’
print re.findall(p,s)
[‘gga’, ‘gga’]


 (the blue text is what I typed into the Python console, the green text is the resulting output)

Simple enough right – the search correctly found two matches in the sequence, but there’s a lot more we can do with regular expressions. Supposing we wanted to search for the pattern ‘ggx‘ where x is either of the purine bases a or g. We can get the result we want by rewriting our search pattern like this:

p = ‘gg[ag]’
print re.findall(p,s)
[‘gga’, ‘ggg’, ‘gga’]


Note that the square parentheses in regular expressions correspond to an ‘or‘ pattern that matches any of the contained characters, and now we see that our search has also revealed a ”ggg‘ pattern in the sequence. We could also extend the search even further by looking for the pattern ‘ggx‘ where x could be any of the 4 bases, like this:

p = ‘gg.’
print re.findall(p,s)
[‘gga’, ‘ggc’, ‘ggg’, ‘gga’]


The symbol . in regular expressions stand for any character, and now the more liberal search pattern has also identified a ‘ggc‘ stretch in our sequence.

Using the dot character to allow any base at a given position opens up many possibilities for more advanced search patterns that are of interest in genome sequencing. For example, it allows us to search for biologically interesting sequences in a genome such as gene promoters that often consist of relatively conserved recognition sequences separated by some number of bases. Many bacterial gene promoters for example, consist of two distinct recognition sequences upstream of the gene to be transcribed like this:


        |                        |         |
       -35                      -10        start of gene

The consensus sequence for the recognition site at the -10 position is ‘tataat‘ while the -35 site has a consensus sequence of ‘ttgaca‘. We can quite easily craft a regular expression that would search a genome for such consensus promoter regions like this:

p = ‘ttgaca……………….tataat’

This would match any sequences that consisted of exactly the two consensus recognition sites separated by a fixed number of bases. Since these sequences are the consensus sequences for motifs that are actually subject to some variability, we could extend the search to account for this variability by allowing mismatches at sites in the consensus sequences that we know to be more variable, like this:

p = ‘ttga.a……………….ta…t’

And finally, we could also create search patterns that allow a variable number of bases between the recognition sites like this:

p = ‘ttgaca.{15,25}tataat’

The curly brackets after the dot are used to indicate a repeat of between 15 and 25 occurrences of the character preceding the brackets – in this case the dot character that matches any base. This search pattern would match any stretch of the genome consisting of the two consensus recognition sites, separated by between 15 and 25 bases of any kind.

Looking at these very simple examples that barely scratch the surface of the what regular expressions are capable of, it is easy to imagine how it is possible to assemble extremely versatile search patterns capable of identifying biologically significant regions of genome sequences. I have shown some extremely simple examples of regular expressions here to give you a flavor of how useful they can be in a bioinformatics context, but how do they fare in real world problems with the kind of extremely large datasets that are typical in genomic sequencing research?

Can an interpreted programming language like Python process data fast enough to be useful in an environment where sequences of hundreds or thousands of millions of bases must be searched?

You might be pleasantly surprised.

To test the speed of Python regular expression searches, I created a random sequence of 250 million bases to represent something like human Chromosome 1, the longest contiguous piece of DNA in the human genome. I then ran a search using our simple -10 promoter consensus sequence ‘tataat‘, one million times so that I could get a good sample of the time taken for each search and derive an average (the fact that I am running the search one million times should already tell you that it is going to be fast).

Here are the results from one million regular expression searches of a genome of 250 million bases:

Number of matches found in sequence = 61226
1000000 searches completed in 0.671034 seconds
Mean search time = 6.71034e-07 seconds

* All quoted search times are for Python 2.7 on a MacBook Air with 8GB RAM, running OSX 10.8.4

Notice that the search recorded over 61,000 hits, so it was not fast simply because there were no matches for it to find and record.

OK, so let’s try a more complicated search pattern in the same large genome. Now we’ll try ‘‘ – the same small sequence, but this time, we are also allowing any base at the position of the second to last ‘a‘.

Number of matches found in sequence = 239310
1000000 searches completed in 0.650807 seconds
Mean search time = 6.50807e-07 seconds


Again, very fast. Notice also that the number of hits has now ballooned to almost 240,000 since we’re allowing any base at position 4 of the search pattern.

“Wait a moment!”, you might be saying to yourself, “Perhaps the average search time is very fast because the same search is being repeated over and over, allowing for some sort of caching of the results.”

No problem – let’s generate a million random search patterns of length 6, and search them all consecutively in a million independent searches. That way, we’ll be sure that there’s no caching going on.

1000000 searches completed in 0.652236 seconds
Mean search time = 6.52236e-07 seconds


Well there you have it. Python regular expressions are powerful and fast – fast enough indeed for searching large genomes if you don’t need the more elaborate scoring and alignment algorithms of a full-blown sequence alignment package.

In the next-generation sequencing project that I described at the beginning of this piece, we are working with much more elaborate Python regular expression patterns than the ones shown here for demonstration purposes, but the speed and efficiency of the searches is still extremely impressive for an interpreted programming language.

Like much of the Python Standard Library, the re module that handles regular expressions is actually written in C, so while the initial function call to the search may be handled by the Python interpreter, the subsequent search is actually being run in compiled native code, which explains its efficiency. Indeed, one of the great strengths of the Python language is that it interfaces vey easily with natively compiled languages like C, allowing computationally intensive parts of your code to be run down ‘on the bare metal’ with regard to CPU and memory, largely free of the overhead of the Python interpreter.

So to any biologist out there reading this, I hope that I have shown how Python and regular expressions can be your friends (if they aren’t already), and how a little time dedicated to getting to know them better, can be time well spent. 

 © The Digital Biologist | All Rights Reserved