Tutorial: Using formal grammars in computational biology

kappa_rule_example.png

PROLOGUE

Next to the human languages that we communicate with, the equations of mathematics are the form of semantic representation most widely-used used in the sciences - yet as powerful as they are, traditional mathematical approaches have a number of fundamental attributes that hinder their use as an effective means to represent complex biological systems. The development of new intellectual frameworks beyond the purely mathematical approaches that have been borrowed from other fields, will I believe, be an essential factor in the advancement of life science research in the coming years. Formal representations that are capable of handling the massive complexity of living systems, and that utilize more biologically intuitive and transparent idioms - will enable life scientists to better organize the huge troves of data that advances in laboratory instrumentation and automation have generated. Most importantly, they will enable the life scientist to reason about these systems at the kind of scale and detail that is commensurate with their complexity , transforming the data that is gathered from them into real knowledge and insight. In this respect, I foresee an ever-increasing role for computer science in life science research that will certainly necessitate a similar shift in the broader life science curriculum in schools and colleges. This tutorial, intended for life scientists with little or no experience in computational modeling, introduces the use of non-mathematical, formal grammars as a means to represent biological systems on computers. As such, it follows on directly from my previous tutorial How to build an agent-based model of a cell signaling network, that describes the actual use of a formal grammar for modeling cell signaling pathways. In this tutorial, we will look “under the hood” in more detail, and examine how formal grammars can be designed and used as vehicles to represent biological systems computationally.

Computers handle abstractions at many levels. At a fundamental level, everything your computer can manage, from a simple integer to an epic novel of love and betrayal, are layers of abstraction on top of a bunch of the ones and zeros that are the substrate of everything in the digital world. When early computer engineers got frustrated at the onerous task of trying to build operating systems and software using the computer’s own idiom of binary and hexadecimal numbers, they invented programming languages as abstraction layers that allow humans to work with computers in an idiom that is more - well “human”. As programming languages themselves evolved, so did the abstractions through which the human operator worked with the computer. Along the way from the earliest down-at-the-metal assembly languages that closely mirror the internal structure of the computer itself, through procedural languages like C and Fortran, our current digital world has now evolved into a vast ecosphere of many, layered, high-level abstractions. With object-oriented programming (OOP) for example, we can develop algorithms based upon a framework of self-contained objects that interact with one another, and that have features like inheritance and polymorphism that were designed to help us better organize our code, make our code more readable, and (hopefully) allow us to write less of it (although I use OOP a lot in my own work and personally find it a very productive and intuitive programming paradigm, I think it’s fair to say that in certain areas, there is something of a shortfall between the promise of OOP and its actual delivery - but the point of this article is not to start a flame war about which programming paradigm is best). It is worth also mentioning in passing, that the internet itself is built upon layers and layers of abstraction, from the shared network protocols that allow computers to exchange data, to the sophisticated rendering and styling of the web pages that you view in your browser.

If we wish to “do” biology on computers, we face the same kind of challenges that the designers of programming languages have had to grapple with for decades. Biologists don’t generally think about life in terms of abstractions involving ones and zeros, but if they wish to model biological processes on a computer, there is a necessity to represent the concepts of biology in a form that a computer can recognize - but also in an idiom that is appropriate for biology and familiar enough to the biologist that it doesn’t feel like having to work in some kind of alien machine language. Since a computer is at its heart, an algorithmic engine for handling logic and mathematics, it is a platform that lends itself naturally to the representation of concepts in fields like physics in which the world is represented in equations, and math and logic are the lingua franca amongst its practitioners. Computational modeling is not yet in the mainstream of life science research in the way that it is in other fields like physics. but what relatively little amount of biological modeling there is, largely employs modeling paradigms borrowed from other fields (like physics) where modeling is more mature. This creates a number of problems, not least of which is that these methods tend to be predominantly built upon mathematical approaches (such as the use of ordinary differential equations ODEs) that do not lend themselves well to the kinds of questions to which a biologist would seek answers. The problems with these mathematical approaches include (but are not limited to) insoluble technical limitations on their ability to adequately represent the vast complexity of biological systems, as well as the often underestimated cognitive disconnect between the mathematical idioms that these modeling approach forces the biologist to work in, and the idioms that biologists actually use. This disconnect affects not only the intellectual frameworks within which biologists reason about the systems that they study, but also the ways in which biologists communicate concepts and ideas about these systems to other biologists.

Given the current situation in which most of the approaches used in biological modeling have been borrowed from other fields, it should come as no surprise that many of the current practitioners of biological modeling are actually mathematicians and computer scientists whose interest in biology has led them to make the leap into life science research. On one hand, this is an extremely fortuitous development for the life science field, since it brings a wealth of new expertise, ideas, and approaches to biology that are sorely needed. The current life science curriculum in most colleges still does not include the kind of math and computer science training that biologists would need to be able to develop such approaches themselves. On the other hand, as we have already mentioned, these new approaches that have been borrowed from other fields, tend not to be well-suited to the study of complex biological systems, and furthermore, to use them competently often requires the kind of training and expertise that these other fields provide, but is still mostly lacking in the life science curriculum.

In a nutshell, the problem is this: many biologists have questions about the systems they study, for which good models of those systems could furnish new hypotheses and significant insight into how those systems work. Most biological modeling approaches however, are severely limited by the traditional, mathematical foundations upon which they are built. Moreover, most biologists are about as thrilled at the prospect of becoming mathematicians in order to use computational models, as the average person would be about becoming a mechanic in order to drive a car.

THE KAPPA MODELING LANGUAGE

A few years ago, I had the good fortune to be hired by a startup newly-funded by two west coast venture capital firms, with the goal of developing a cloud-based, collaborative modeling platform tailored specifically to the needs of biologists studying complex living systems. Underpinning this platform was a biological programming language called Kappa. Like a regular computer programming language, Kappa has a semantic formalism that gives it a great deal of analytical power, while also imposing a kind of monastic discipline on the coder, in terms of syntax and structure. Kappa has a very sparse and readable syntax that is capable of representing the five distinct processes shown in the diagram below - yet despite its simplicity, it is able to effectively model almost any kind of biological process that you could observe in a living cell. You can see a real world example of a Kappa rule in the figure at the top of this article, which shows a unidirectional Kappa rule that describes a protein-protein interaction that occurs in the dopamine signaling pathway - the dephosphorylation of a critical threonine residue (T75) on the dopamine- and cAMP-regulated neuronal phosphoprotein DARPP-32. Here are the five different types of Kappa rule:

simple_kappa_examples.002.png

Kappa is a rule-based language that represents biological processes in the form of rules that define the interactions of the agents to which they are applied. Every Kappa rule expresses a reaction that has a left and right-hand side, and describes a unidirectional or reversible transformation in which one or more of these synthesis, degradation, binding, unbinding, and modification processes occur. Kappa solves some of the major problems that beset the modeling approaches borrowed from other fields:

  • it represents biological processes in an idiom that is familiar to biologists (binding, unbinding, modification etc.)

  • it can be used to assemble models of even the most complex of biological systems since all possible states of the system do not need to be individually enumerated ahead of time, as they do using traditional mathematical approaches such as ODEs

  • it is as human-readable and transparent as traditional mathematical models tend to be arcane and opaque

  • its semantic formalism affords the coder incredible power to do model introspection and analysis in ways that would not be possible using traditional mathematical approaches

Looking at the code underneath each process diagram, you might be tempted to think that despite this being a modeling platform for biology, it all still looks very much like computer science - and you would be largely correct in this assumption. There is simply no getting around the fact that some kind of detour into the realms that border computer science and yes, even mathematics, is necessary if you want to develop new computational approaches for handling the massive static and dynamic complexity of biological systems. In fact, I predict that biologists are at the very least, going to have to meet computer scientists and mathematicians halfway by acquiring some of their training and expertise in the course of their own life science education. It’s actually hard for me to imagine that the life science field will not be drawn inexorably in this direction in the coming years -especially given the incredible unmet need that I perceive in the life sciences, for these kinds of methodologies.

All that said however, even the Kappa language is very much closer to the idioms of biology than a set of ODEs, and it lends itself very well to serving as the foundation for higher layers of abstraction such as the visual rule builder that you see in the figure at the top of the article. In fact, I see a useful analogy here between Kappa and Adobe’s page description language PostScript. PostScript, a formal grammar itself, is a language that was designed primarily to tell printers how to layout printed pages of text and images. Close-up, it looks like a very dense and arcane programming language, and nobody would probably ever want to use PostScript directly for creating documents - which is why we have word processors. The word processor provides the friendly what-you-see-is-what-you-get (WYSIWYG) human interface that shows the writer how the document will look, in real-time as they type. The writer doesn’t have to fret about the minutiae of how to make a line of characters align exactly to the page margins, or what the correct offset from the left-hand side of the page should be for a centered title - it’s PostScript that takes care of all this stuff under the hood. In an analogous way, one could consider a language like Kappa to be capable of serving as a kind of PostScript for biology - in certain use cases, it could be a powerful formalism that underpins more user-friendly abstraction layers to support the biological modeler. If you’re wondering what this might look like, for a part of my career, I actually participated in the development of a collaborative, cloud-based platform for building Kappa models visually, and you can see a little video demo of it here.

FORMAL GRAMMARS

I have described the use of Kappa for building biological models at length in previous articles, so I don’t want to spend too much time here talking about Kappa modeling per se. We use Kappa as our primary tool for the stochastic, agent-based modeling of biological systems that we do for our consulting clients, so we are obviously fans of this approach. Rather, in this article, I would like to talk about the use of formal grammars in computational biology, and to use Kappa as an example of the power of formal grammars for representing biological systems on computers.

So what are formal grammars and why as a biologist, should I care about them?

This actually brings us full circle, back to our initial discussion about abstractions, since with Kappa, we are using a formal grammar as an abstraction for representing biological systems to computers. Formal grammars and their cousins the ontologies, are powerful tools for developing computational abstractions, and are the foundation of a lot of the applications and software that are used in many areas, including biology. So what is a formal grammar? I’ll let our good friends at Wikipedia field that question …

In formal language theory, a grammar (when the context is not given, often called a formal grammar for clarity) is a set of production rules for strings in a formal language. The rules describe how to form strings from the language's alphabet that are valid according to the language's syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form.

If you examine the Kappa rule at the top of this article, you can get a good sense of what this definition is saying with regard to formal grammars dealing only with the form of strings and not their meaning. The critical threonine site T75 on DARPP-32 has the state label p in the left hand side of the rule and u in the right. In the description of the rule, we say it represents the dephosphorylation of the T75, but in Kappa terms, we could have used any label we want to signify these states. The only thing that the Kappa rule truly represents is the change of this label in going from the left to the right hand side of the rule. There’s no sense whatsoever in which any “meaning” for these state labels is implied or represented in the Kappa grammar. These simple labels only mean whatever the modeler wants them to mean and it is in fact only the modeler who provides any meaning that might pertain to the model at all. It’s important to understand then, that the grammar is itself, nothing more than a set of rules for the production of strings. The following for example, is as valid a Kappa string as you will see anywhere else in this article, and it demonstrates very nicely, how easy it would be to use Kappa for something entirely outside of biology.

TrafficLight(color{red}), Car(status{stopped}) -> TrafficLight(color{green}), Car(status{moving})

In other words, although Kappa was designed to be used for modeling biological processes, there is no inherent meaning in its grammar that is specific to biology (or anything else for that matter).

In the course of our consulting work, we have designed and used formal grammars in a variety of life science research areas including the agent-based modeling of cell signaling pathways, the design and synthesis of novel nucleotides and their oligomers, and the hardware configuration of diagnostic laboratory robots, to name but a few. The ability to design and implement formal grammars is especially useful where for example, you need a domain-specific scripting or control language, or computational support for an ontology or system of classification. The aim of this article then, is to provide a very brief introduction to formal grammars and their use in biology, using the Kappa language as an example.

Probably the easiest way to demonstrate the formal grammar behind Kappa, is to look at a typical use case in which a biological system is modeled using Kappa. We won’t consider synthesis and degradation rules in our example, which are in any case, fairly self explanatory, and we will focus only on the binding, unbinding and modification rules. In the process of learning the grammar for these rules, we will also end up learning pretty much everything we would need to understand the synthesis and degradation rules. In the example we develop here, we’ll look at one of the simplest possible biological processes - a kinase binds its substrate, phosphorylates its substrate, and then unbinds its phosphorylated substrate.

Here’s what a visual representation of this biological process looks like, in the idiom in which it has been modeled using Kappa.

simple_kappa_examples.001.png

The “players” in the biological system, usually referred to as agents in this kind of modeling, are typically proteins and/or other molecules in the cell. In this example, there are two agents, K (the kinase) and S (the substrate). Agents in Kappa have sites, represented in the diagram by the blue circles. Each site has a unique (for the agent) label that distinguishes it from other sites on the same agent. Different agents can have the same site names, but these are still distinct since each site is referenced with respect to the agent it belongs to. In computer science terminology, this is referred to as scope. The scope of a name is the context in which it is valid - in other words, we could say that the site names must be unique within the scope of an agent. The sites themselves have a twofold “purpose” in the use of Kappa for modeling biological systems - they serve as binding sites between agents (and sometimes even within the same agent) and they also have states that can be modified, as represented by the labeled purple circles on the sites. A state label on a site is no more than that - a label., and it can mean whatever the modeler wishes it to mean. You might also have been able to guess from the figure above, that sites must always have unique labels within an agent, but whether or not they have labeled states or are bound to anything is optional. More on this in a minute.

Before we move on to the grammar in detail, you will notice that a Kappa rule is essentially a transformation from the ensemble of states on the left hand side of the rule, to those on the right. All of the three rules in our simple example are unidirectional, but Kappa rules can also be (and often are) reversible. In other words, the transformation in a reversible rule can also occur in the opposite direction, starting with the ensemble on the right and ending with the ensemble on the left. In our kinase example, this means that we could have written the separate binding and unbinding rules as one reversible rule, but we chose not to, just for the purposes of making the Kappa implementation of this 3-step process easy to follow. A binding between two sites is shown in the visual representation, as a line connecting the two bound sites.

So let’s look at the syntax of Kappa in more detail, but still informally - by which I mean we’re just going to describe the syntax in regular terms for now, because we are not yet ready to look at how this can be defined in a rigorous way. So at this stage, I guess you could call it a kind of “informal grammar”. If you look at the Kappa rule strings underneath each diagram, you can follow what is going on.

An agent in Kappa syntax (one of the “players” in the model) is a name followed by a set of parentheses that contain the sites if there any - a Kappa agent does not need to have any sites. So the following are all valid Kappa agent definitions:

K(x), S(a, b), Calcium(), C3b(x ,y, z), H20(), Hb+Fe(o1, o2, o3, o4) 

But there are constraints on the characters that a name can contain - so for example, you will get an error if you try to define any of the following agents:

3Cb(), Kinase/Cleaved(), Hb*Fe()

This is because agent names cannot start with numbers and cannot contain certain symbols such as ‘/’ and ‘*’ which are used in other specific contexts within the Kappa syntax.

Site names in Kappa are similarly constrained, and in addition to being unique within the agent, each site must appear within the agent’s parentheses, in a comma-separated list. Sites can also have two additional features that can be seen in the example figure. Each site can optionally also have a state that is represented by a state label inside curly brackets like this {p}, and a binding index that it shares with another site in the ensemble to which it is bound (usually but not always on a different agent). This binding is represented by an integer index inside a set of square brackets like this [2]. In the same way that site names must be unique within the scope of an agent, the binding index pairs that link 2 sites, must be unique within the ensemble of agents being described, to ensure that there is no ambiguity about who is binding whom. In the figure at the top of the article for example, you can see that the index 1 is used for the binding between the site bind on DARPP-32 and the site bind on PPA2, while the index 2 is used for the binding between site bind on calcium and site calcium on PPA2. There is also a way to define the site as unbound by using a period inside the square brackets,instead of an integer, like this: [.] It should be noted that state labels are also somewhat constrained in their format, similarly to agent and site names.

With all of this in mind, here are some valid agent definitions in Kappa syntax.

K(), K(x), S(a, b), K(x[2]), S(a{u}, b{p}[2])

As we will see very soon, some of the “rules” that we have informally defined here for the Kappa syntax, will be enshrined in the formal grammar that we develop to rigorously define it, while some will not be, since they are not purely “grammatical” features. Incidentally, since Kappa is itself composed of rules, the rules that we will develop to define its syntax might be considered “metarules”, or “rules for rules”.

Based upon this introduction to a small subset of the Kappa syntax, we are now in a position to figure out what is going on in the simple biological system that we are modeling using Kappa. In the initial binding rule, we see that the change from the left to the right hand side of the rule is the binding of the kinase (K) site x to the unphosphorylated (u) site a on the substrate (S), as represented by the binding index 1 which is now shared between K(x) and S(a) in the right hand side of the rule to denote the binding between them. In the second, (modification) rule, we see the change in state of site a on the substrate (S) from u to p - you will recall from our earlier discussion that these labels have no inherent meaning in Kappa, but we as the modelers have chosen to use u and p to represent the unphosphorylated and phosphorylated states of site a, respectively. The third, unbinding rule is the reverse of the first, binding rule, with the shared binding index 1, being removed from K(x) and S(a). Again, it is useful to refer to the visual representation above each Kappa rule, to get a better sense of what process the rule is representing.

You will also notice that although the site b on the substrate is not directly involved in the binding, modification, and unbinding interactions, it is defined as being in an unbound state [.] and therefore the rules that define these interactions will only be applied to the agent ensemble shown in the diagram, when the substrate site b is unbound. This might be for example, because the modeller wishes to represent a situation in which the substrate sites a and b are close together and are sterically only able to bind the kinase one at a time, or just because the modeler prefers to write separate rules for handling these interactions differently when the substrate b site is in a bound state.

Before we get to the business of rigorously defining a grammar for Kappa, we should also note that a rule itself has an overall structure that builds upon the syntax that we have already described. A rule is composed of a left and right hand side, each of which contains a list of agents (defined as we have described) separated by commas. The left and right hand side of the rule are separated by a single-headed arrow symbol - ‘->’ for unidirectional rules, or a double-headed arrow symbol ‘<->’ for bidirectional rules. This by no means whatsoever, an exhaustive survey of the Kappa syntax, but it is enough to serve as the foundation for an example of how to develop a formal grammar.

So the question now is - how do we rigorously define this set of syntax rules that we have informally defined, and how do we represent it computationally in such a way that a Kappa rule string could be parsed and analyzed by a computer?

PARSING AND METALANGUAGES

Parsing has a more generalized meaning in everyday language, but with regard to formal grammars in a computer science context, parsing is something quite specific - once again, we can turn to our good friends at Wikipedia for help:

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar.

A parser in computer science is a piece of software that essentially deconstructs and analyzes a string of symbol and characters according to the rules of a formal grammar. Computer programming languages like Python and C must be parsed before they can translated into machine instructions that can be run by the computer. A web page must be parsed before it can be rendered in a web browser. Parsers are ubiquitous and one of the most important functions of a parser is to identify strings that are invalid because they do not conform to the production rules of the formal grammar being used. In Kappa for example, this might be an agent with two sites that are not separated by a comma within the parentheses that follow the agent’s name.

So how might we approach parsing a Kappa rule?

Consider this top-down approach to the problem …

  • A Kappa rule consists of left and right hand sides, separated by an arrow to show the rule direction.

  • The left hand and right hand sides of the rule are lists of agents and the arrow is a single or double-headed arrow (depending upon whether the rule is unidirectional or bidirectional)

  • The lists of agents in the left and right hand sides are Kappa agent definitions separated by commas

  • A Kappa agent definition is the agent’s name, followed by a list of the agent’s sites enclosed within parentheses, and separated by commas

  • A site is a name, optionally followed by a state definition and/or a binding definition

  • A state definition is a set of curly brackets containing a state label

  • A binding definition is a set of square brackets containing either an integer or a period

What’s going on here is essentially a top-down deconstruction of the Kappa rule using a recursive algorithm that goes something like this:

“At the highest level, a rule consists of components A, B and C - A consists of components, D and E - D consists of components F, G and H etc. etc. “

Starting at the highest level, we progressively deconstruct the components of the rule string into smaller and smaller pieces, until we have gone as far as we can go - for example, when we get down to the level of the “indivisible atoms” of the rule string such as the actual characters themselves. In the Kappa syntax, this happens for example, when we get down to the description of the first character in a agent name, which must be a letter and not a number. In fact, even the agent name itself must be constructed from at least two elements, due to the requirement that the first character cannot be a number - but numbers can appear anywhere after the first character in the agent name string so this requires us to define separate syntaxes for the first character in the agent name, and for the second and subsequent characters. If we were to informally describe the syntax rules for an agent name, it might look something like this:

agent_name = first character from lists a-z and A-Z + any number of subsequent characters from lists a-z, A-Z and 0-9

The line above is what computer scientists refer to as pseudocode - an informal, high level description of an algorithm that lacks the semantic formalism of actual code. If we want to provide instructions to a computer for parsing the formal grammar of Kappa, what we need is something akin to this kind of description, but formal enough that we can actually implement it as code. It should not have escaped the reader’s attention here, that the situation we’re describing is one in which we are using one formal grammar to define another. Since the Kappa language is itself composed of rules, what we need are “metarules” that define the rules by which Kappa rules are written. In computer science or linguistic terms, this is a metalanguage with which we can define languages. Fortunately for us, just such a metalanguage already exists and parsers that use it have been implemented in almost all of the most popular computer programming languages, including our favorite, Python.

A FORMAL GRAMMAR FOR THE KAPPA LANGUAGE IN BNF

Developed by a couple of computer scientists between the 1950s and 1960s, Backus-Naur Form (BNF) is a metalanguage that can be used to define certain types of formal grammars known as context-free grammars. A context-free grammar is a set of string production rules, which can be applied to any nonterminal symbol, regardless of context, to produce the set of strings that the grammar is capable of generating. A nonterminal symbol is a lexical element in the grammar, that does not appear in the final language, whereas a terminal symbol is an element (symbol) of the final language - like a lowercase letter or a number for example. All of the rules in a context-free grammar are applied in a one-to-one, one-to-many, or one-to-none fashion, and the left-hand side of a rule must always be a nonterminal symbol. If all of this sounds a little complicated, it will be easier to understand when we see it in actual use, so here’s the first section of the BNF code that defines the grammar for Kappa.

We will start with the BNF for a Kappa agent name:

<upper_case> ::= “A” | “B” | “C” | “D” | “E” | “F” | “G” | “H” | “I” | “J” | “K” | “L” | “M” | “N” | “O” | “P” | “Q” | “R” | “S” | “T” | “U” | “V” | “W” | “X” |  “Y” | “Z”
<lower_case> ::= “a” | “b” | “c” | “d” | “e” | “f” | “g” | “h” | “i” | “j” | “k” | “l” | “m” | “n” | “o” | “p” | “q” | “r” | “s” | “t” | “u” | “v” | “w” | “x” |  “y” | “z”
<digit> ::= “1” | “2” | “3” | “4” | “5” | “6” | “7” | “8” | “9” | “0”
<allowed_symbol> ::= “+” | “-” | “_” | “:”
<letter> ::= <upper_case> | <lower_case>
<name_character> ::= <letter> | <digit> | <allowed_symbol>
<name> ::= <letter> | <name><name_character>

Each BNF statement uses the ::= token to define the entity named on the left hand side, according to the lexical structure described on the right. The upright bar symbol | denotes “or”, so we can see that a <digit> can be any one of the integers from 0 to 9. Similarly, the entities <upper_case> and <lower_case> can be any one of the uppercase or lowercase letters respectively, listed on the right hand side in their definitions. If we now look at <letter>, we see that the named entities that we create in BNF can also appear in the definitions - in this case, we define <letter> to be any one of the characters defined in <uppercase> and <lowercase>. As an aside - yes, I could have defined a longer, single list of uppercase and lowercase characters, effectively combining <uppercase> and <lowercase> into one named entity, but I chose to use two entities because it makes the BNF more readable, and it gives me additional flexibility - for example if I ever need to use only uppercase or only lowercase letters in a subsequent definition. Just before we get to the definition of an agent name itself, we define <name_character>, which determines the complete set of characters that are allowed to be used after the first position in an agent name. This includes all uppercase and lowercase letters, the set of integers from 0 to 9, and a small number of special characters that are also allowed to appear in agent names. It is in the definition of <name> however, that we start to see the real power of BNF, for we see in the definition of <name>, a recursive reference to the entity we are defining, within its own definition. The way to read this definition is that an agent name can consist of either a letter, or an already existing agent name followed by another character from the set of characters that are permitted in agent names.

This might seem strange at first, but let’s consider it in the context of the production rules for the formal grammar that we are defining. As we start assembling our agent name, it is clear from the definition that we need to begin with a letter since the only alternative given in the definition is to use <name><name_character> - in other words, the agent name we have assembled so far, followed by another allowed character. Once we have a letter in the first position, we then have an <name> and from here on in, we can assemble the agent name by iteratively adding a <name_character> to it. If we try to start the agent name with a number, we are already in violation of the grammar.

The BNF above may look like a lot of code, just to define what letters and digits are, but starting from scratch with a new grammar always requires a fair bit of legwork in the beginning - defining all these basic elements of which the more complex elements are built - but the effort pays off later as we continue to build the grammar on these foundational elements. For example, we will reuse the grammar for the <name> element to also define site names.

Now we have a grammar that defines the structure of agent (and site) names, we still need to define the full grammar for a site before we can build up to defining the grammar for a complete agent with anything from zero to multiple sites. You will recall that Kappa sites have one required feature, the name - and two optional features - a binding status and a state. A Kappa site in its various forms, unbound and bound, stateless and state-defined etc., can be represented as follows:

Sh2   Sh2[.]   Sh2[1]   Sh2{p}   Sh2[.]{p}   Sh2[1]{p}

Kappa actually has a “don’t write, don’t care” syntax that allows the modeler to omit the binding status or the state of the site (or even the entire site) completely, in an agent description. This essentially means that we don’t care about these conditions in the context of the present rule, and that they will not be checked when an agent is being screened to see if it matches the patterns described in the rule. There are other syntactical ways in Kappa for explicitly declaring less-constrained or unconstrained binding states for sites, but for the purposes of this introduction to formal grammars, we will not be covering them within the subset of the Kappa syntax that we are presenting here. Returning to the site grammar now - since we already have the grammar for the site name (we’ll just use <name> again), let’s develop the grammar for the (optional) binding notation.

<dot> ::= “.”
<open_sq_br> ::= “[”
<close_sq_br> ::= “]”
<binding_index> ::= <digit> | <binding_index><digit>
<binding_status> ::= <dot> | <binding_index>
<binding> ::= <open_sq_br><binding_status><close_sq_br>

Since the binding status of a site can be represented using either a dot for the unbound state, or a binding index, we have defined a <binding_status> element that can be either one of these. In the <binding_index> element, we have used the kind of recursive description that we used to build the <name> grammar, defining the binding index as either a digit, or a binding index followed by another digit. This allows the <binding_index> element to consist of anything from one to multiple digits, although in practice, any given Kappa rule is very unlikely ever to contain enough binding descriptions within the same rule, that it would require the use of binding indices with more than one digit - but hey, you never know! Finally, we put it all together to describe the overall binding status of the site in <binding>, as a set of square brackets containing the binding status in the form of a dot for the unbound form, or an integer index for the bound form.

Now we will define the grammar for the site’s state.

<open_curly_br> ::= “{”
<close_curly_br> ::= “}”
<state_label> ::= <name_character> | <state_label><name_character>
<state> ::= <open_curly_br><state_label><close_curly_br>

This grammar for the state of the site looks very similar to the grammar for the binding. To denote a state, we use curly brackets instead of square brackets, and the brackets contain a state label rather than a binding status, but the idea is very much the same. The <state_label> does not have the same requirement to have a letter in the first position as an agent or site name, but it is does have to be composed of the same set of permitted characters.

Now that we have grammar definitions for the site’s name, binding status and state, we are ready to put it all together and define an overall grammar for a Kappa site. Remember, the binding and state notations are optional, so we need to be able to handle a site that is presented in name only, or with either one or both of the binding and state declarations, as shown previously:

<site_status> ::= <binding> | <state> | <binding><state> | <state><binding> | ""
<site> ::= <name><site_status>

Since the Kappa syntax allows either the binding or the state declarations to be omitted entirely, or included together in any order, we have defined a <site_status> element that describes all five possible combinations, including an empty string (““) to signify nothing.

Now we’re ready to define the grammar for a complete agent and its sites.

One situation that we have not yet encountered, is the use of lists in our grammar. Agents have comma-separated lists of sites within the parentheses that follow their names,. We can see this clearly in the Kappa rule example shown at the top of this article, which was taken from a dopamine signaling model. Here’s one of the the Kappa agent descriptions for the DARPP-32 protein:

DARPP-32(T34{p}, T75{p}, S137{p}, bind[1])

The Kappa grammar allows for an agent to be represented with an arbitrary number of sites from zero to many, which means that our grammar will need to be able to represent an agent with empty parentheses, a single site, or a comma-separated list of multiple sites. We can see that a non-empty site list always starts with a site, and that the list can be extended by adding another comma/site pair. With this in mind, we can use the same kind of recursive description trick that we used to build up a <name> string. Here then, is the grammar that describes a complete agent, and is able to represent an agent with anything from zero to multiple sites:

<comma> ::= ","
<site_list> ::= <site> | <site_list><comma><site>
<agent_sites> ::= <site_list> | ""
<open_paren> ::= “(”
<close_paren> ::= “)”
<agent> ::= <name><open_paren><agent_sites><close_paren>

Now that we have a grammar for representing Kappa agents, we’re ready to progress upwards to the level of an actual rule - remember, the goal of working through this grammar development exercise was to be able to parse the simple kinase, binding, modification and unbinding rules that we show in the diagram above. With that in mind, let’s consider the syntax for a rule.

A Kappa rule has essentially three features - a left hand side agent pattern, a single-headed or double-headed arrow that indicates whether the rule is unidirectional or bidirectional, and a right hand side agent pattern. In the synthesis and degradation rule types shown in the diagram of the five Kappa rule types earlier in this article, you can see that the left hand side agent pattern can be empty in the synthesis rule type, and the right hand side agent pattern can be empty in the degradation rule type. While we do not need to handle synthesis or degradation processes in our simple three-step example, we will also design our rule grammar to handle these cases - because why not!

Analogous to the comma-separated list of sites that we handled in the agent grammar, either side of a rule contains a comma-separated list of agents (that can also be empty in synthesis and degradation rules). We’ve already handled a situation like this, so the rule grammar should be pretty straightforward to write at this point.

<rule_direction> ::= "->" | "<->"
<agent_list> ::= <agent> | <agent_list><comma><agent>
<agent_pattern> ::= <agent_list> | ""
<rule> ::= <agent_pattern><rule_direction><agent_pattern>

A RECURSIVE DESCENT PARSER FOR THE KAPPA LANGUAGE

At this point, we have developed a complete grammar that handles the small subset of Kappa syntax that we need to represent our simple three-step, kinase/substrate model. If we now step back and consider this grammar from the top down, i.e. from the rule level down, we can see that what we have created is a grammar that describes a progressive deconstruction of the Kappa syntax in which each grammatical element is successively deconstructed into its component elements until at last we arrive at some indivisible, foundational element that we cannot break down any further - the terminal symbols of the grammar like the letter “K” for example. If we look at the last BNF declaration that describes the structure of a rule, we can for example, break down <agent_pattern> into a list of <agent> elements., and each <agent> can be broken down into a <name> and its list of <site> elements, and each <site> can be broken down into the site’s <name> and its binding and state declarations and so on, and so on …

What we have created here is a grammar that can used for parsing strings in an approach known as recursive descent. Many of the computer languages in common use today are actually context-free grammars whose compilers process the language syntax using recursive descent parsers - each step in the recursive descent consisting of the processing of the nonterminal symbols in the language’s grammar.

To round out this tutorial, I would like to show you the BNF-encoded grammar that we have developed here, in actual use. Based upon the BNF code from this article, I implemented this Kappa syntax grammar in our go-to programming language Python, so that I could apply this grammar to the Kappa rules that we used as an example, and demonstrate just a little of the analytical power and flexibility of this kind of formal grammar in the representation of biological systems. A discussion of the Python code that I wrote to implement this grammar from the BNF is beyond the scope of this tutorial, and would make it far longer than I would want it to be. The subject of parsing is itself a whole area of study that spans the fields of computer science and linguistics, and even restricting the discussion of parsing to its implementation in Python, would be a major undertaking. What I will say however, is that there are a number of excellent Python packages for handling formal grammars and recursive descent parsing, including PyParsing, which is the package that I used to translate the BNF in this article into Python code and create a working recursive descent parser.

Here are the Kappa rules from our simple three-step kinase/substrate model, and their outputs from the Python parser that implements the BNF in this tutorial:

Kappa Rule =  "K(x[.]), S(a{u}[.], b[.]) -> K(x[1]), S(a{u}[1], b[.])"
   Rule LHS Agent Pattern:
   Agent: K
      Site: x  binding status = ., state = undefined
   Agent: S
      Site: a  binding status = ., state = u
      Site: b  binding status = ., state = undefined
-- Rule Direction: -> (unidirectional):
   Rule RHS Agent Pattern:
   Agent: K
      Site: x  binding status = 1, state = undefined
   Agent: S
      Site: a  binding status = 1, state = u
      Site: b  binding status = ., state = undefined

Kappa Rule =  "K(x[1]), S(a{u}[1], b[.]) -> K(x[1]), S(a{p}[1], b[.])"
   Rule LHS Agent Pattern:
   Agent: K
      Site: x  binding status = 1, state = undefined
   Agent: S
      Site: a  binding status = 1, state = u
      Site: b  binding status = ., state = undefined
-- Rule Direction: -> (unidirectional):
   Rule RHS Agent Pattern:
   Agent: K
      Site: x  binding status = 1, state = undefined
   Agent: S
      Site: a  binding status = 1, state = p
      Site: b  binding status = ., state = undefined

Kappa Rule =  "K(x[1]), S(a[1], b[.]) -> K(x[.]), S(a[.], b[.])"
   Rule LHS Agent Pattern:
   Agent: K
      Site: x  binding status = 1, state = undefined
   Agent: S
      Site: a  binding status = 1, state = undefined
      Site: b  binding status = ., state = undefined
-- Rule Direction: -> (unidirectional):
   Rule RHS Agent Pattern:
   Agent: K
      Site: x  binding status = ., state = undefined
   Agent: S
      Site: a  binding status = ., state = undefined
      Site: b  binding status = ., state = undefined

And for our final trick - just for good measure, - we will test our parser on the more complex rule from the dopamine signaling pathway, that is shown in the figure at the top of this article.

Kappa Rule =  "DARPP-32(T34{p}, T75{p}, S137{p}, bind[1]), PP2A(Calcium[2], x{p}, bind[1]), Calcium(bind[2]) -> 
               DARPP-32(T34{p}, T75{u}, S137{p}, bind[.]), PP2A(Calcium[2], x{p}, bind[.]), Calcium(bind[2])" 
   Rule LHS Agent Pattern:
   Agent: DARPP-32
      Site: T34  binding status = undefined, state = p
      Site: T75  binding status = undefined, state = p
      Site: S137  binding status = undefined, state = p
      Site: bind  binding status = 1, state = undefined
   Agent: PP2A
      Site: Calcium  binding status = 2, state = undefined
      Site: x  binding status = undefined, state = p
      Site: bind  binding status = 1, state = undefined
   Agent: Calcium
      Site: bind  binding status = 2, state = undefined
-- Rule Direction: -> (unidirectional):
   Rule RHS Agent Pattern:
   Agent: DARPP-32
      Site: T34  binding status = undefined, state = p
      Site: T75  binding status = undefined, state = u
      Site: S137  binding status = undefined, state = p
      Site: bind  binding status = ., state = undefined
   Agent: PP2A
      Site: Calcium  binding status = 2, state = undefined
      Site: x  binding status = undefined, state = p
      Site: bind  binding status = ., state = undefined
   Agent: Calcium
      Site: bind  binding status = 2, state = undefined

This last rule is so long that I split the rule string across two lines in the parser output, but you can see that even with this nominal grammar for a small subset of the Kappa syntax, we are now able to parse a “real world” Kappa rule from the dopamine signaling model.


EPILOGUE

In biology as in many other fields, the ability to design and implement domain-specific languages for the purposes of representing and modeling the systems being studied, can be tremendously useful. In biology, the (understandable) tendency of life science researchers to borrow more traditional, mathematical approaches from other fields where modeling is more mature, has certainly contributed to the relative dearth of modelling platforms that are really tailored to the needs of the biologist. This is probably also due to the lack of formal training in computer science that most biologists receive in the course of their education and their careers. In the consulting work that we do with our life science clients, we have observed this directly. There is a significant demand amongst biologists of all stripes, to be able to use computers much more in their own research, and to move beyond the spreadsheets and calculators that are still often the principal quantitative tools available to them for managing and analyzing their data. When we run our workshop course Python For LIfe Scientists, there are always far more scientists who sign up for it, than we are able to offer places for the 12-week session. Biologists are starting to embrace more and more, the idea of using computers in their research, and they should not fear that biology is in any danger of becoming a virtual science in which everything is discovered through modeling and simulation. The study of living systems will always be something that happens primarily in the laboratory and in the field, but computational modeling and simulation can be an invaluable adjunct to these empirical approaches - helping biologists to generate and test hypotheses, and design new experiments - as well as helping them to organize their data, and to reason with it. With powerful and relatively cheap computers in plentiful supply, there’s probably never been a better time for experimental biologists to consider introducing computational approaches into their own research. It is only by expanding the use of computational modeling and simulation beyond the domain of the theoretical biologists, that it will become a part of the mainstream in life science research in the way that it is in other fields.

© The Digital Biologist

Gordon Webster