Life science beyond the spreadsheet


In this era when the average cell phone has more computing power than was available to an astronaut on one of the Apollo missions, the days when a scientist used a slide rule for numerical calculations are well behind us. In spite of this however, the great majority of life scientists working in academia or in the typical biotech/pharma laboratory, use little more than its modern equivalent, the electronic calculator or the spreadsheet, for handling their research data. This is not to say that these aren't fine tools for what they do, but when you consider the amount of computing power available to a researcher in a life science laboratory, it does seem akin to someone who works at a sawmill, still cutting planks with a hand saw.

At the other end of the spectrum from these tools for doing very general calculations, is the kind of specialized enterprise software that is either purchased or comes bundled with laboratory instruments for managing and analyzing the data that they produce. Such software generally offers a suite of the kinds of analysis and visualization that are typically required for the data that it handles. For the most part however, these packages are relatively hard-wired and inflexible. If the scientist needs for example, their data presented in a slightly different format that better matches the specific needs of their research project, they are probably out of luck since they generally don't have:

  1. the source code that would enable them to modify the software,

  2. the programming skills to do it even if they had the source code,

  3. the person who wrote the code to explain to them how it works,

  4. the time to do it

  5. all of the above

Anyone who has worked with such software has experienced the frustrations of this scenario - "No, just show me which of these 10,000 sequences has a variant of my gene in it and does not have the HindIII restriction site". The specialized software tools that are available over the internet such as the BLAST tools for searching, retrieving and aligning biological sequence data, may relieve researchers and their IT departments from the burden of having to install and maintain the software locally, but for the most part, they are similarly inflexible in the sense that they perform a limited set of specific "one size fits all" functions.

All of this software is great if it happens to do exactly what you need it to do and in exactly the way that you want it done. It is however impossible for the scientific software writers to anticipate all of the ways in which scientists will want to use their software and this is precisely where the possibility to "roll your own" can be so invaluable. Using a computer exclusively with the software that others have written for it, may be the norm for most of the computer-connected world, but when you think about it, it's kind of like having a car that can only be driven on a very limited number of roads. The ability to define your specific task to the computer and have it process the task for you, unleashes the computer's real potential and opens up a world of possibilities that are simply not available to you if you rely exclusively on pre-packaged software.

So how does this impact the life scientist?

I think it's fair to say that there is something of a disconnect between the average life scientist's exposure to computer programming and the degree to which even a modicum of experience and training in it could benefit their research. Spreadsheets can be very powerful and made to do quite complex tasks, but they work best for the kind of numerical data that can be easily and logically ordered into a tabular format, and that can be read from and written to files that also reflect this format. I have worked in pharma companies where the Microsoft Excel spreadsheet application is pretty much the only general computing tool used by its researchers - and at almost all career levels from the most junior lab technician to the most senior executive. Furthermore, aside from the obvious examples of researchers routinely performing the same calculations over and over by hand using a calculator, I have seen many examples of problems with the handling, analysis or presentation of laboratory data for which even a simple computer script of a few lines in length, could save the researcher hours of effort.

Then what are the alternatives to this situation?

There exists a plethora of useful programming platforms of one sort or another available to the life scientist, many of them at no cost. Between spreadsheets and full-blown programming languages are the mathematical platforms for data analysis and visualization of which the best known commercial examples are MATLAB and Mathematica. These platforms offer the flexibility of a mathematical scripting language within an environment that is tailor-made for numerical data analysis and visualization. Free and open source alternatives include GNU Octave and the R Language for statistical computing.

The high-level, non-domain specific programming languages are perhaps the most flexible solution of all, insofar as they give the user the freedom to express their problem in an almost infinite variety of idioms, and most of them also include extensive mathematical and scientific libraries that remove the need to re-invent the wheel for the more common scientific computing tasks such as fitting an equation to experimental data or handling vectors and matrices. Some languages such as Python and Ruby lend themselves extremely well to the rapid scripting of solutions and sweeter still, they are available at no cost. Just open an editor, type a few lines of code, hit run and your task is done (give or take the odd bug or typo in the code). This is not to say however that these languages are not also great for creating large complex applications to solve big problems. They are. I often call Python the "Swiss Army Knife of computing tools" because it is so versatile and indeed, many of the world's leading technology-driven organizations such as Google, Industrial Light and Magic and NASA use it to solve their own large and complex problems. 

The Java programming language (now owned and administered by Oracle) is also widely used in the life sciences and although it is more suited to programming larger applications than writing quick scripts, it does offer the advantage of being significantly faster than Python or Ruby for more computationally intensive applications. The same Java code can conveniently be run on any computer platform that supports the Java environment (and most do) and while the rich and extensive Java programming tools are available at no cost, there is a somewhat steeper learning curve for the Java language, particularly for those with no programming experience.

Given the quantity and quality of programming and scripting tools that are available, one might wonder why they are not a part of every life scientist's arsenal, yet they are not. To be clear, there is a minority of life scientists for whom computer programming is an integral part of their daily research, but at the time of writing, they are still very much a minority and not really the intended audience of this article. It has been my experience over the couple of decades that I have worked in the life science field since leaving graduate school, that the majority of life scientists who work in laboratories in academia and industry could benefit significantly from more training in computer programming as part of their studies or onging work experience, yet most have little or no exposure to it.

Rightly or wrongly, these scientists may feel that programming is a skill for which their education and training has not prepared them and one that is time-consuming to acquire. This sentiment is easy to understand given the minimal amount of computer training in most life science curricula, the incredible breadth and depth of scientific skills that a life scientist must already master to become an effective researcher and of course, the endless pressure to publish and be productive in a fiercely competitive environment that leaves little time for excursions beyond the intellectual focus on the research at hand.

It is this author's opinion that more emphasis on computational skills, both in the life science curriculum at college and in the career training of life scientists, would benefit not only the individual researcher, but also the academic field and the life science industries for which it is the foundation. Furthermore, in keeping with my belief that computational approaches will become increasingly mainstream in biological research as they are in other fields, I see this problem becoming even more pressing in the future.

Gordon is a partner at the digital biology consulting firm Amber Biology, a Ronin Scholar and co-author of Python For The Life Sciences.

 © The Digital Biologist