Gene Ontology and Object Oriented Programming

The Gene Ontology (GO, project is one that seeks to “standardize the representation of gene and gene product attributes across species and databases.” The way they do this is by creating a directed acyclic graph representing the functions and relationships of specific genes. Each node passed through increases the specificity of the child node. This is vaguely similar to how inheritance in object orientated programming.

In object oriented programming, inheritance promotes code reuse without duplication. The way this works (briefly) is, a base class is created containing methods and variables that classes could inherit from it. For example, if you were modeling vehicles, you could have a Vehicle class, which contains methods like move(), stop(), etc. and variables like moveSpeed, passengers, etc. This comes in handy when you make different types of vehicles like Bikes, Cars, and Trains. When you do that, you can inherit the methods like move() and stop() as well as the variables like moveSpeed and passengers. Thus, you only have to write those methods and variables once in the base Vehicle class. This image may be of help in visualizing it:

This is nice and tidy way of representing the world, only the world doesn’t exactly work like that. A great example is here, which explains how some of the principles of object oriented programming break down in video game programming.

The point is, not all objects can be placed in hierarchies with inheritance; sometimes you have groups that just don’t fit in anywhere and this turns the beautiful directed acyclic graph of inheritance into an ugly mess.

Back to gene ontologies, when storing our information about genes in a hierarchical manner, it’s possible that not all genes inherit the same properties as the nodes above it. However, I believe that GO has gotten around that by defining different relationships between nodes. For example, in this GO term, there are several relationships between nodes: is a, has part, is part, etc.

It’s possible that a similar method of inheritance could fix some broken models produced by objected oriented inheritance, but I don’t know of any languages that implement this fully (I also have not looked very much).


Understanding Markov Chains

Before you start, have you read the conventions page?


A Markov chain is a mathematical model for random processes where time is taken into consideration. For each step of time (each tick of a clock), there is a probability that some event will happen. This probability depends on what the previous event was.

For example, let’s say we built a robot that has a very simple set of instructions. It can only move 1 meter in 1 of 2 directions (forward or backwards) every second and the direction it chooses depends on the last direction it chose. If the robot moved up in the last second, it has a 75% chance of moving down in the next second (and vice-versa). If you observe and record the robot’s movement at each time step, you might get something like this:

t | movement
0 | forward
1 | backwards
2 | forward
3 | backwards
4 | backwards


This could go on for any number of time steps, generating a sequence of events. This sequence is called a Markov chain.

A better definition

Mathematically, when a random process, X = (X_0, X_1, X_2, ..., X_n), follows the following property, it is a Markov chain:

P[X_{n+1} = s | X_n = s_n, X_{n-1} = s_{n-1}, ... , X_0 = s_0] = P[X_{n+1} = s | X_n = s_n]

where S is a set containing all the possible states (the state space). Each member of X is a random variable. The above equation is stating, simply, that the future value of X (X_{n+1}) is dependent on the present values of X.

We can also create a diagram of the possible states and probabilities of transition in something called a state transition diagram:

This can also be described as a matrix. The same Markov chain is described below:

This matrix is called the transition matrix. Each value in the matrix must be greater than zero, and each row much sum to 1. Or in fancy latex: P_{ij} \geq 0, \sum_{j} P_{ij} = 1 for all i.

Storing the probabilities in a matrix allows us perform linear algebra operations on these Markov chains, which I will talk about in another blog post.


Stanford Markov Chain tutorial:

Youtube Markov Chain video:

Understanding CRISPR/Cas

Before you start, have you read the conventions page?

With the recent New York Times article on the subject, I thought it would be a good idea to work through how CRISPR/Cas works and explain its significance in a little more detail.

What is CRISPR/Cas?

CRISPR/Cas is an immune system found in bacteria (and archaea) that stops viruses from replicating themselves.

When viruses invade cells, their goal is to replicate their DNA and create new viruses. They invade cells because they don’t have the cellular machinery (ribosomes, tRNAs, etc.) to do so themselves. To replicate themselves, they inject their DNA into a cell. The cellular machinery that replicates DNA is kind of dumb, so it will replicate any DNA that it finds. The virus is then copied many times and leaves the host cell so it can invade other cells.

In order to stop this, some bacteria have a system that can recognize viral DNA and create proteins that chop it up into pieces, stopping it from creating new viruses. This system is the CRISPR/Cas system.

How does it work?


CRISPR/Cas is a simple and, in my opinion, elegant immune system. Basically, the bacterium creates an RNA that recognizes invading virus DNA and binds to it. Then, a generalized protein finds the RNA and cuts the virus DNA. Thus, CRISPR/Cas is composed of two main parts, a specific module and a general module. CRISPR is the part that corresponds to specificity and Cas (as you may have guessed) is a more general set of genes. The main advantage to this modularity is adaptability. By separating the part that destroys DNA with the part the recognizes DNA, the bacteria can make one protein that cuts and one RNA that guides, rather than having a specialized protein that both binds and cuts DNA for every virus. When a new virus enters the cell, the bacteria can add the DNA to the CRISPR portion and drop it right into the existing immune system.

CRISPR stands for “Clustered Regularly Interspaced Short Palindromic Repeats”, which gives us an idea of the structure of the DNA. In the diagram above, there is a region labeled “repeat array.” This is made up of spacer sequences separated by repeats (the terminology is a little misleading). These spacers contain the recognition sequences the Cas proteins use to find specific DNA sequences.  These spacer sequences are actually viral DNA that have been integrated into the bacterial genome. When the spacer sequence is transcribed to RNA, it can line up with the actual invading virus DNA and bind to it. Then the Cas protein can cleave the virus DNA and inactivate it.

The other portion of the repeat array is the repeats, which are palindromic. This means that once the DNA is transcribed to RNA, the RNA can base pair with itself and form the loop structure seen in the diagram above. This structure interacts with the Cas protein and forms a larger structure, which is labeled as the protein-RNA complex above.

Cas stands for CRISPR associated sequence and is named that way because when researchers discovered CRISPR, they found this group of genes near it. Cas has a few genes that are known to cut DNA. Their main function is to find the CRISPR RNA and cut the DNA that it’s bound to. Additionally, some Cas genes are known to add in viral DNA to the CRISPR spacers, allowing more virus DNA sequences to be recognized. The way this happens isn’t well-known, however.

Thus, when these genes are transcribed and translated, 3 parts come together. First you have the Cas protein, which cleaves DNA. Next you have the palindromic repeat region of the RNA that binds to the Cas protein and finally, there is the spacer region which finds the specific virus DNA and binds to it.

How are people using it?

One of the exciting things about CRISPR/Cas is the way that people are using it. As the NYTimes piece above explains, it’s possible to “edit” genomes. One way to do this is to modify the Cas9 gene, the gene that cuts the DNA, to be catalytically inactive. This means that instead of cutting like it’s supposed to, it just binds to the DNA and stays there without cutting it. Then, the DNA it bound to cannot be transcribed to RNA because there is a large protein in the way of RNA polymerase. This is the method presented by Qi et. al. in 2013, where they use this method for gene repression.

Another way to inactivate genes is to use the CRISPR/Cas system the way it was intended to — to cut specific DNA sequences. The difference here is that we can engineer the system to bind to a sequence of our choosing rather than a sequence that is already present in the system. For example, if there is a gene we want to turn off, we can engineer the system to recognize a portion of it (20-72 base pairs). When the small guide RNA locates that sequence, Cas9 will cut a few base pairs of the DNA. Now, there are preexisting mechanisms in the cell that can repair DNA breaks like this known as non-homologous end joining. However, this mechanism doesn’t know what the DNA should look like, all it does it stick the ends of the DNA back together. Because Cas9 cut a few base pairs out, when the DNA is glued back together, it is missing some information. When the DNA is transcribed and translated into a protein, the protein will now contain different information and will be nonfunctional. This is what Niu et. al. 2014 did in the paper linked to by the NYTimes piece.

Gene inactivation/repression is useful when there is a gene producing harmful transcripts and that gene is not necessary for the organism’s survival.

Genome editing is another use of CRISPR/Cas. There is a mechanism in cells called homology directed repair that uses DNA sequence that the cell knows is “reputable” and uses that to copy the information into the DNA break. Thus, it’s possible to use CRISPR/Cas to create a targeted break in the DNA and introduce a gene to the cell and have the cell use the introduced gene as a template for homology directed repair.

How is it different from similar methods?

Targeted cleavage of DNA is not something new to science. Several methods allow for the cleavage of DNA in very specific regions such as restriction endonucleases, zinc finger nucleases, and transcription activator-like effector nucleases. As you may have guessed from the names, a nuclease is a protein the cleaves DNA.

A restriction enzyme is a protein that recognizes a very short sequence of DNA (4-8 base pairs) and cleaves the DNA at a certain spot in that recognition sequence [wiki]. These are very commonly used in biology and their activity has been well characterized.

Zinc finger nucleases are engineered proteins that have two parts [wiki]. One part cleaves DNA, just like Cas. The other part is a DNA binding protein called a zinc finger. Just like the spacer region of the CRISPR/Cas sequence, it can bind to specific sequences. The difference here is that the zinc finger is a protein, while the spacer region is an RNA.

Transcription activator-like effector nucleases are also engineered proteins with two parts [wiki]. Like the zinc finger nuclease and the Cas protein, one domain cleaves DNA and one domain binds to DNA. The DNA binding domain in TALENs are called transcription activator-like effectors, and they have very specific nucleotide recognition sequences.

Both zinc finger nucleases and transcription activator-like effector nucleases can be engineered to recognize specific sequences, just like CRISPR/Cas. However, both methods are susceptible to off target cleavage if the target site is not unique in the genetic sequence of interest. One of the reasons CRISPR/Cas is exciting is because it has a longer recognition sequence so there is less of a chance of off target cleavage.


Hopefully this post taught you something more about the CRISPR/Cas system and how it’s being used (it certainly taught me). If you find errors or things that should be clarified, please don’t hesitate to tell me!


  1. Lei S. Qi, Matthew H. Larson, Luke A. Gilbert, Jennifer A. Doudna, Jonathan S. Weissman, Adam P. Arkin, Wendell A. Lim. “Repurposing CRISPR as an RNA-Guided Platform for Sequence-Specific Control of Gene Expression.” Cell – 28 February 2013 (Vol. 152, Issue 5, pp. 1173-1183). DOI.

  2. Fedor V. Karginov, Gregory J. Hannon. “The CRISPR System: Small RNA-Guided Defense in Bacteria and Archaea.” Molecular Cell – 15 January 2010 (Vol. 37, Issue 1, pp. 7-19). DOI.