Code Tutorial: Getting started with Python in the lab

python-logoMy previous article was inspired by my impression that many life science researchers are missing out on some of the wonderful computational tools that are out there, and that could be invaluable in helping them to manage and analyze their laboratory data. This may be due to a lack of awareness that these tools exist and that many of them are inexpensive or free, or it may be out of fear that learning to use them will take too much time out of an already hectic schedule. In starting this series of code tutorials for life science computing, I hope to point life scientists with relatively little exposure to programming languages, in the right direction to be able to start using these tools to write useful code that can solve real problems.

In this tutorial we will create a simple program that can predict the pattern of single-strand DNA fragments that you would expect on your electrophoresis gel after doing a restriction digest. So without further ado, let’s introduce one of our favorite programming languages Python. Python  is an incredibly versatile programming environment, being well suited for creating solutions from small scripts of a few lines in length, to large scale applications that are used throughout global organizations. The Python development environment is freely available for download from the official Python web site. For users of  Windows and Mac OS X platforms, Python comes in nice self-installing packages that pretty much do everything for you. For Linux users doing a manual install, you will need to unzip the install package and run a shell script or (more conveniently), you could just do a one-click install with one of the excellent software package managers that come with most Linux OS distributions. The good news for Linux users is that chances are, your Linux OS already comes with Python included and if you’re running your own Linux system, you probably don’t need our help installing stuff anyway!

We’ll be writing our code using the nice clean IDLE development environment that comes with Python. IDLE is really a kind of specialized text editor that can actually run the Python code you type into it. On Mac OS X or Linux, it can be launched from a terminal window (usually by just typing “idle” if everything is installed and configured correctly). On Windows you will probably need to launch it from the command line (“run”) window.

Is IDLE running now? Congratulations, you just ran your first Python application because IDLE itself is actually written in Python!

Just Do It

So let’s start by creating a (very) simple DNA sequence in Python. In IDLE, type the following line and hit the return key

mysequence = “atcg”

In programming jargon, ‘mysequence’ is a string of 4 characters. IDLE will probably color code the “atcg” part to show this. Now type the following and hit return

len(mysequence)

If everything went well, IDLE should now display the number 4 which is the length of the string in characters returned by the ‘len’ function. Now try typing the following lines. Make sure you indent the ‘print ch’ line by one tab space just as I have done and you will have to hit the return key twice after the ‘print ch’ line to actually run the code.

for ch in mysequence:
    print ch

You should now see your sequence printed out in order, one character per line like this

a
t
c
g

The code you just typed translates into english as “for each item in ‘mysequence’, give the item the temporary name ‘ch’ then print ‘ch'”. This kind of loop construction is known as an iterator in Python. Python knows that strings are made up of smaller items, in this case characters, so when you iterate over a string using the ‘for’ command, the iterator returns each character in the string, in order of occurrence. I used the name ‘ch’ because the items in this case are characters, but I could just have easily used any other name like ‘item’ or ‘c’ or ‘roger_the_shrubber‘.

The indent in the ‘print ch’ line is significant. Indents are the way to create blocks of code in Python that are to be executed all together. Every line of code that is indented by one level underneath the ‘for’ statement, gets executed on each round trip through the loop, so if we had for example, also wanted to print the position of the nucleotide in the sequence, we could have added one or two more indented lines under the ‘for’ statement like this

i = 0
for ch in mysequence:
    print ch
    print i
    i += 1

In this modified version, a variable i is declared before the loop and then also printed and incremented by 1 on each pass through the loop. The ‘i += 1’ syntax is Python shorthand for ‘i = i + 1’. The important thing to note here is that all 3 of the indented lines are executed for each pass through the loop.The output from the loop now looks like this

a
0
t
1
c
2
g
3

If we were to unindent  the ‘i += 1’ line, this would remove it from the loop and it would become the first line to be executed after all the passes through the loop had been completed. Can you figure out how this would change the output printed by the loop?

Let’s bring in the string section

Another feature of strings (and all list-based objects in Python) is that their characters/items can also be accessed according to their numerical position in the object, starting from 0 (all lists, sequences, arrays and so on in Python are numbered from 0), up to 1 minus the length of the object. So mysequence[0] = ‘a’, mysequence[1] = ‘t’ and so on. You can also slice strings up using [from:to] syntax like this

name = “Arthur King”
print len(name)
  11
print name[0:6]
 Arthur
print name[:6]
 Arthur
print name[7:11]
 King
print name[7:]
 King
print name[-4:]
 King

Note the following features of Python indices:  the last number in the ‘from:to’ range is exclusive; if the ‘from’ or ‘to’ index is excluded, the beginning or end of the string/list is assumed; a negative index for a string/list denotes the position -n from the end of the string.

Dictionaries are your friend

So now we know how to create DNA sequences as strings and how to iterate over them, let’s introduce the Python dictionary. Dictionaries resemble in some aspects the kind of list object that we already encountered in strings, but instead of the items being labeled according to their numerical positions, they can be labeled pretty much any old way we choose. For example …

mydictionary = {“galahad”:”pure”,”lancelot”:”brave”}

Now if you type

print mydictionary[“lancelot”]

you get

‘brave’

and so on. Dictionaries are composed of ‘key:value’ pairs and the stored value for each key can be accessed using the mydictionary[key] syntax. Dictionaries can also be added to like this

mydictionary[“bedevere”] = “wise”

So now let’s create a dictionary to store the molecular weights of the 4 standard DNA nucleotides

nucleotides = {‘a’:131.2,’t’:304.2,’c’:289.2,’g’:329.2}

Note that we used single quotes here. Python allows both single and double quotes to be used interchangeably but the closing quote must be the same as the opening quote. What this dictionary does in effect, is to translate the symbol for the nucleotide into a molecular weight, for example

print nucleotides[‘t’]

prints the value

304.2

So now we are in a position to actually do a useful calculation. By combing the ‘for’ iterator for our sequences and the nucleotide dictionary, we can calculate the molecular weight of any DNA sequence. Here’s how …

molecularweight = 0.0
for ch in mysequence:
    molecularweight += nucleotides[ch]
print molecularweight

When you run this code, you get the molecular weight for our little 4 nucleotide sequence

1053.8

Now we could also do all kinds of fancy stuff like adding the molecular weight of a 5′ phosphate if one is present, or accounting for DNA adducts or chemically modified bases, but you get the general idea.

Organize your code with functions

Since we’re going to want to use the same molecular weight calculation on different sequences, let’s create a Python function that takes a DNA sequence as input and returns its molecular weight.

def calculateMolecularWeight(sequence):  
    molecularWeight = 0.0
    for ch in sequence:
        molecularWeight += nucleotides[ch]
    return molecularWeight

In Python the ‘def’ statement defines a function by its name ‘calculateMolecularWeight’ and by the parameters that it takes as input – in this case a single parameter ‘sequence’. Note that Python is a dynamically typed language that does not require you to define what kind of data ‘sequence’ is before you run it, so the calculateMolecularWeight function would happily accept an integer like 2 as the input parameter, but it would raise an error when it tried to process it since an integer has no items in it for the ‘for’ loop to iterate over. Notice also how all of the code that forms the body of the function is indented (with the code in the ‘for’ loop being indented one more level still). The ‘return’ statement in a function does two things – it exits the function and if it is accompanied by a parameter as in this case, it sets the return value of the function to that parameter.

Now that we have our function, we can calculate the molecular weight for any DNA sequence that we give it as an input parameter, like this

mwt = calculateMolecularWeight(mysequence)

or we can even supply the sequence parameter explicitly like this

mwt = calculateMolecularWeight(“gatgctgtggataa”)

Lists: The missing ingredient

So now let’s look at adding the final feature – the ability to detect restriction sites in the sequence and calculate the molecular weights of all the DNA fragments produced by a restriction digest.

We can define restriction sites as sequences in a similar way as we did for DNA sequences. For example

bamH1 = “ggatcc”
sma1 = “cccggg”

but in addition to the restriction enzyme’s recognition motif, we also need to include information about where the enzyme cuts the DNA strand within the motif, so let’s exapnd our deinition of a restriction site by using the very versatile list format that Python offers.

bamH1 = [“ggatcc”,0]
sma1 = [“cccggg”,2]

Lists can be formed by grouping items in square brackets as shown above. Python lists can contain other Python objects of any kind – even other Python lists! The second item (an integer) in our restriction site definition is the position in the motif after which the enzyme cuts the DNA (again numbering from 0 as the first position). The items in lists can be accessed by their positions, just like the characters in a string, so typing

print sma1[1]

returns the value

2

Remember that Python lists are numbered from 0, so ‘sma1[1]’ is actually the 2nd item in the list. Now that we have our restriction sites, we need a way to search for them in a DNA sequence. Fortunately there is a very handy ‘find’ method for Python strings that does exactly what we want to do. The ‘find’ method is a property of strings in Python and such a property can be accessed using the dot notation like this

position = mysequence.find(“tag”)

This looks for the next occurrence of the amber stop codon in ‘mysequence’ and returns its position in the sequence. If the search motif is not found, ‘find’ returns the value -1. Notice the phrase ‘next occurrence’ above. The ‘find’ method returns the position of the first occurrence of the search motif that it finds.  If you want to look for multiple occurrences, an additional parameter can be supplied to ‘find’ to tell it which position in the sequence to start searching from, like this

position = mysequence.find(“tag”,172)

Then for example, if you find the first stop codon at position 172, you can search again from that position to look for the next occurrence and so on. If no position parameter is supplied to ‘find’ (as in the first example), the search is always done from the beginning of the string.

The .find syntax works for any Python string  because in object-oriented programming (OOP) terms, ‘mysequence’ belongs to the general string class and ‘find’ is a method defined for that class. We will hardly delve at all into OOP in this introductory tutorial, except to say that Python is an OOP language and with a little OOP knowledge there are ways that we could write the code in this tutorial much more cleanly and elegantly in an OOP style. For instance, we could define our own DNA sequence and restriction site classes, with their own properties and methods. Fortunately for us however, unlike some other OOP languages, Python does not force the programmer to work in an OOP style all the time, so we can learn just the general nuts and bolts of Python for now and save OOP for another day.

We already have our molecular weight function, so let’s add another function for doing restriction digests of our sequences. This function will take a sequence and a restriction site definition as input and return a list of the digested DNA fragments. Here it is …

def digestDNA(sequence,rs):
    frags = []
    last = 0
    while 1:
        i = sequence.find(rs[0],last)
        if i == -1:
            frags.append(sequence[last:])
            break
        else:
            frags.append(sequence[last:i+rs[1]])
            last = i+rs[1]
    return frags

The line ‘frags = []’ creates an empty list that we will use to store our DNA fragments. The variable ‘last’ will be used to record the last place that we found the restriction site defined in ‘rs’. Initially, ‘last’ must obviously be set to 0 so that the search starts at the beginning of the sequence.

The Python ‘while’ command is another kind of iterative loop that works differently from the ‘for’ loop that we have already encountered. The ‘while’ loop will continue to cycle for as long as the condition after ‘while’ is true. It is easier to understand how the ‘while’ loop works by studying this example

x=0
while x < 10:
    print x
    x += 1

This loop will print the numbers from 0 to 9 and then stop once x reaches 10. In our function however, we don’t seem to have any condition at all after ‘while’, but just a rather mysterious integer, the number 1. The reason for this is that integers in Python can be used as shorthand for true or false where an integer greater or less than 0 is ‘true’, with 0 being ‘false’.

In our function then, the loop would seem to run forever since 1 is always true and never gets changed. If you look in the indented body of the loop however, you will see the ‘break’ command which exits the loop. The loop in our function is essentially configured to run forever, until we encounter the situataion where there are no more occurrences of the restriction site motif (in which case sequence.find() returns -1) and then we simply add the remainder of the sequence to our list of fragments and exit the loop using the ‘break’ command.

The ‘if’ construct tests whether the value returned from ‘find’ equals  -1. Notice that in the conditional statement we use ‘i == 1’ which means “return ‘true’ if i equals 1 or ‘false’ if not”, as opposed to “i=1” which means “set the value of i to 1”. If the condition following ‘if’ is true, everything in the indented ‘if’ block is done. The ‘else’ block is a way to tell Python what to do in the event that the ‘if’ condition is false. In our case, if ‘find’ does find another occurrence of the restriction site, we use the restriction site cut position to define the new fragment we’re generating, we add the new fragment to our fragment list using the ‘append’ method that is a property of  Python lists and we update the ‘last’ variable so that next search for the restriction site will continue from the position of its last occurrence. Pretty simple right?

All together now …

Now we can finally put it all together to create a Python script that will calculate the molelcular weights of the DNA fragments produced by a restriction digest. You will notice that the ‘digestDNA’ function returns a list of DNA sequences, so after we have run the function, we can use our iterative ‘for’ loop to go through the returned list of fragments one at a time and calculate the molecular weight of each one, like this

for fragment in digestDNA(s,sma1):
    print item
    print calculateMolecularWeight(item)

Notice that instead of assigning the return value of the ‘digestDNA’ function to a variable and then iterating over the variable like this

fragmentList = digestDNA(s,sma1)
for fragment in fragmentList:
    …
    …

we just used the function itself directly where the variable would go. This is perfectly OK to do in Python and in fact, it is often good form to do this to make the code more concise and readable.

So let’s create a fake sequence with a couple of SmaI restriction sites in it and test our code.

s = “atcgatcgatcgcccgggatcgatcgcccgggatcgatcgatcg”
for fragment in digestDNA(s,sma1):
    print fragment
    print calculateMolecularWeight(fragment)

If everything went well, running this code should yield the following output

atcgatcgatcgcc
3739.8
cgggatcgatcgcc
3962.8
cgggatcgatcgatcg
4438.2

If we wanted to do a multiple digest, for each additional restriction site, we could apply the same approach to the fragments generated by the previous digest(s). In this case, we could write our code in such a way that it could be applied recursively to our original sequence and the subsequent fragments. Recursive methods are easy to write in Python, but we’ll save recursion for a future tutorial.

One of the great things about Python is that it is a very popular language with a large and flourishing community. This means that for most problem domains, including the life sciences, somebody somewhere has probably already developed a library of Python functions for your particular problem. As you might have guessed, there already exist Python libraries for the kind of bioinformatics problems that we have tackled here. Since the goal here was to learn enough basic Python to be able to write some useful code, we did not rely on any of them for our simple problem. Here are some links however, to give you an idea of the kinds of life science libraries that are available in Python.

Biopython – a set of freely available tools for biological computation
Biskit – an object-oriented library for structural bioinformatics research
SciPy – a library for mathematics, science, and engineering, with many useful algorithms for life science computing

This list is by no means exhaustive, but it gives some idea of the universe of useful Python tools that are available to the life scientist.

Don’t stop there!

So now that you have hopefully seen enough to appreciate how easy it can be to write useful code, we encourage you not to stop here. We will be publishing future tutorials, but in the meantime, why not head on over and browse the many fantastic references, tutorials and other resources for new Python users that can be found at the documentation area of the Python web site.

  © The Digital Biologist | All Rights Reserved

2 thoughts on “Code Tutorial: Getting started with Python in the lab

Leave a Reply