Unless you were raise by wolves, you will not be surprised to hear that the DNA of differnt organisms contains a kind of recipe for developing and maintaining that organism over its lifetime. Squirrel DNA has a recipe for squirrels, dog DNA for dogs and so forth.
But wolves not withstanding, you might be surprised to hear that there is another extremely ancient message in DNA that comes to us over billions of years. It is an evolutionary journal of the stages of evolution that led up to squirrels, dogs, and every other living organism from a single common ancestral population billions of years ago.
More surpring is that the message is buried not in one set of DNA, but in the pattern of similariteis and differences in DNA between different organisms. You won’t get this message from only analyzing squirrel DNA, but it will start to surface as you compare the genes of a number of related organisms.
Powerful mathematical techniques can be applied to comparing the gene sequences of related organisms to decode their evolutionary history. Now with the advent of cheap and fast computing power, and cheap and fast gene sequencing, biologists all around the world are busy analyzing DNA and constructing the entire evolutionary tree of life using these techniques.
But how is it that mathematics can be used in biology to chart the history of an organism’s evolution? What kind of information would be in DNA that could be decoded with mathematics and how was that discovered?
In the latter part of the 20th century, as molecular genetics came into its own, biologists realized that the steps in the process of evolution fulfill the definition of what is called a Branching Markov Process. Markov processes are a class of processes that share a certain set of properties that were characterized very early in the 20th century by Russian mathematician Andrey Markov. While this realization might sound ike something of interest to only mathemeticians, the implications are:
- If the diversity of all life on Earth has come about through evolution from a common ancestral population, the history of that billion year journey is recorded in the DNA of organisms alive today.
- The genetic message that comes to us across deep time is encoded in the highly unique and recognizable pattern produced by a Branching Markov Process.
- We should be able to confirm the theory of evolution for all life on Earth by finding the Markov message in the DNA comparisons between different living organisms.
- The message contains what we need to recreate the tree of life. We can recreate that tree branch by branch by sequencing the DNA of related organisms compare them using Markov mathematics.
“Mathematics is the languge with which God has written the universe.” – Galileo Galilei
Galileo would probably not be too surprised to hear that the process that gave rise to the entire diversity of life on Earth can be described mathematically. Nor would he be surprised to hear that the same mathematics can be used to decode the history of that process. The mathematics of Markov analysis are pretty fierce, but fortunately Markov processes are easy to describe using no mathematics at all. I promise that, and I promise that there will be no quiz on this on Friday.
A Drunkard’s Walk
First let me define a Markov Process as a process that changes the state of something in a small random way over and over again. The most popular example would be a drunk person walking across a field. The “state” of the walking drunkard at any given moment would be the position of the step he has just taken. As he staggers around the field, each new step is a new state. If you record the position of each step he takes, you will have a sequence of positions or states. Markov analysis would call the sequence of states a Markov Chain.
What makes this a Markov Process that produces a Markov Chain is that each new state is closely related to its previous state. The drunkard can only change his position equal to or less than one stride length. If you number each state (or step, in this case) you would find that the any state Sn is not much different than its predecessor Sn-1. The difference is equal to or less than the length of the drunkards stride.
Markov processes have a very short memory. They don’t look back very far in their history. If I told you the location the drunkard was in at this very moment, you can’t predict where his next step will be, but you can predict that it will land somewhere within a circle whose radius is no longer than the length of one stride. Notice I did not have to tell you anything about his previous positions leading up to where he is now for you to predict the possibilities for his next step. So this drunkard’s walking process has “the Markov Property”, where each new state is only dependent on its current state.
If you happened to run into the drunkard just standing in the field you might ask him where he came from. If he can’t manage to remember, you would be hard pressed to figure that out. All you know is that the step before the position he is standing on is within a circle that is no larger than one length of his stride.
So far it doesn’t seem like Markov process is much of a help in figuring out the history of a process. But let’s take another example. This time the process is that a number of different people around the country are going to make a chain letter before there was email. They are instructed to each sign the letter, make a copy, and mail the copy to another person who has not already recieved and signed the letter. Suppose the letter has no signing page and no room anywhere to sign it, so the signers put their name anywhere on the letter they can find room.
Now suppose you are the last guy to get the letter. You get your copy in the mail and it has all the signatures except yours. Stepping back for a moment, notice that this is another Markov Process. The letter has gone through a series of states where each copy is the same as the one before it but has one more signature. But notice that his time the copy arrives in the mail and it has more information about it’s Markov Process than the drunkard case. If it has 23 signatures except yours you at least know it went through 23 previous states. You know this because this particular Markov process left some evidenc. The small changes from state to state accumulated on the document. The only problem is that since they were signed all over the front page, you still don’t know what order the states occurred in. In fact, the signers might have all met at a high school reunion somewhere and all signed it at once and then sent you a copy.
This is fun but how does this relate to evolution, you might ask? Let’s suppose you caught a squirrel, took a sample of its DNA and sequenced it? You now have the equivalent of that chain letter. The squirrel is the offspring of a set of parents who at conception made a copy of their DNA and passed it on to the first cell of the new squirrel. But since the copying process is imperfect there is a chance of a slight mutation (or more) taking place that altered the DNA copy. The same thing happened when the squirrel’s parents were born from the grandparents, and so forth. The DNA has gone through a couple of billions of years of imperfect copying where each copy varied slightly from the previous one by a small random mutation. And like the letter signing example, the results of each step along the way accumulated in the DNA in the form of mutations generation after generation all the way down to the squirrel you have in your laboratory cage. I think the implications are clear so far that the process of evolution, among other things, is a Markov Process when it comes to what happens to DNA over the many generations it is handed down.
But we still have the same problem as the chain letter. We don’t know the order in which the mutations occurred. So how can we chart the evolutionary history of the squirrel? All we can see is the one final state, which has carried with it all the accumulated mutations like so many signatures. Fortunately for us and for evolutionary biologists, the analogy of the contract signing process is not complete when it comes to the Markov Process of evolution. There is more that we can examine than the DNA of a single organism.
To get closer to what happens in evolution, let’s suppose a new scenario where people are mailing chain letters. But let’s modify the rules to produce a mega-chain letter. Let’s have each person sign the letter but make ten copies and distribute them to five different individuals who are supposed to do the same thing. Notice that this still describes a Markov Process, but rather than producing one Markov Chain it produces a large number of chains that are all different in a certain way. When Bob receives a letter he signs it and makes copies. But when he sends each copy to five different people he is establishing five branches off of his particular chain. Why? Because each of the five copies will now receive a different signatures in their next step in the process. And then the same thing will happen again as those people sign, copy, and forward to five more people.
If you drew a diagram for this, each state would produce new branches that go off in on their separate paths, further branching at each new state. A diagram of this would show the branches fanning out from one point, then each branch fanning out again, and so on, to form a tree-like structure. You could trace a path on the diagram from the originator of the letter all the way to any final recipient. That one path would be a Markov Chain, because they are all different Markov Chains. But the entire tree of chains would be a set of Branching Markov Chains.
Now if you picked up one of the letters from its final recipient, you would still have the same problem as the contract signing. You have a document with a lot of signatures, but no way to tell in which order they occurred. But something really interesting happens if you have all the final letters. The first thing you would notice is that none of them have the exact same set of signatures. But the next thing you would notice is that they all have at least one signature in common (the originator of the letter, let’s call him Bob). As you keep comparing them all, the next thing you might notice is that the letters can be sorted into five groups, where the letters in a particular group share two signatures, one being Bob and one being someone else. Let’s say the first group is the Bob-Mary group and the second is the Bob-Joe group and so on. These second shared signatures from the first set of five recipients from Bob’s mailing. The next thing you might notice is that the Bob-Mary letters can be further broken down into five subgroups where the letters in each subgroup share another signature. That would be the Bob-Mary-Pete subgroup and the Bob-Mary-Sue subgroup and so on.
By now you can see a pattern emerging. The similarities and differences in the signatures on the final letters contain information about the branching that a single letter does not have. And the successive grouping and dividing of the letters allow you to redraw the original tree that describes the order of the letter signing that originally took place, even if you did not already know how the letters were created.
If you start diagraming this, since the letters all have the Bob name, you plant the tree at Bob. Seeing that all letters can be sorted in groups of five by another name, you then can draw the Bob-Mary branch, then the Bob-Joe branch, and so on. Then because each of those groups can be further divided into their own five subgroups you can draw those branches and so on. Soon you will have drawn the entire tree that describes the original routing that all the letters took from one person to the next.
You might notice that this tree that you draw from the letter comparisons looks like a genealogy chart. The similarity is not a coincidence. If you substitute the letter for a human genome, and you realize that each child born inherits a copy of the parents’ genes but with a random variation, you can see that the building of an extended family in each generation is a Branching Markov Process. And the states that are being changed along each branch point is the state of the genome as it is copied and handed off to the next generations where it accumulates mutations just like the branching chain letter accumulated signatures.
The final results will produce relatives who can be grouped by their genetic differences and similarities just like the final chain letters could be grouped by similarities and differences in the signatures. And just like the chain letters, you can recreate the family tree diagram for those youngsters even without having the DNA of any of their parents, grandparents, or any of their ancestors. Now simply project that notion back milions of generations and you can see that the entire world of living organisms is related in the same way that people in an extended family are related. The reason why we know that is the same reason why we can tell siblings from cousins and second cousins and so forth in a family tree by comparing their DNA.
Evolution is a Branching Markov Process and it accumulates information across the genomes of related organisms in the precise pattern that we would expect from that kind of process. Encoded in that pattern is evolution’s billion year journal of where it has been and what it has accomplished.
If that seemed too easy, you are correct. Sorting piles of messages doesn’t seem very math intensive. But consider that the genome has billions of base pairs that could be mutated in any which way. Sorting them by hand would take to the end of time. Fortunately, powerful computer techniques can be used to do the analysis.
To give you a feel for what those techniques have to deal with, let me extend my last branching email problem by one more challenging angle. Suppose your job at the FBI was to establish the alibi of many of the participants in the email chain letter. But to make the problem more challenging, suppose none of the chain letter participants signed any of the letters.
You go back to your office and ponder this for a while and then while staring at the letters spread out on your desk you realize that some of them have other artifacts on them. It looks like some of the letters in the process laid around for a while on peoples’ desks before they were signed, copied, and mailed out. One group has a coffee mug stain, another has the remnants of a swatted fly, another group has a phone number written in the corner, another has doodles, and so forth. While looking for other artifacts, you notice that some of the copiers had a dirty glass causing some letters to have little smudges and nits that the copiers picked up.
Why this is important is that each letter picked up unique artifacts at the same location where it was copied, and the artifact was copied five times along with the signature. That means you might be able to do the same analysis using the artifacts alone. In fact there might be further artifacts due to the unique lens distortions in the copiers at each location as well. That means you might be able to do the analysis using the distortion from each of the lenses and the artifacts found on the letters. While they are good standins for the missing signatures they are distributed all over each of the letters.
So now you decide to pull out your heavy hitter Markov Analysis software. You scan each letter with a very high resolution scanner which turns the letter into a grid of pixel values like any digital image. Then you feed them into the bigtime Markov software which uses the similarities and differences between all the letters on the basis of the pixel values. That way it can include the effect of the all the artifacts from lying around each office and from the lens distortions of each copier. The software crunches for a half hour or so and then comes back with a tree for the same reason that your manual signature sort produced a tree.
Just to be sure you instruct the software to redo the analysis a number of times using different specific areas on the letters. For example, you have it analyze all the letters but only using the top left hand quarter of each letter, then repeat that for the top right hand quarter, and so on. Each analysis produces its own tree that it coaxed out of its particular area of the letters.
If you can see why each of the trees it produces should be identical (or very close) then you are starting to realize the implications of how one could prove almost beyond a shadow of a doubt that the letters were indeed produced by a branching chain letter process (which is a branching Markov Process). It would be very difficult if not impossible to fake all those letters in a way that could possibly produce the same results unless the information accumulated through the branching chain letter process.
The unsigned branching chain letter analysis analogy is very close to what happens when DNA from different related species are sequenced and analyzed. Each genome from each different organisms is like the final letters in the analogy, where the billions of base pairs that make up the information in the genome are like the pixels from the images of the letters. Comparing the whole genomes of a group of different but related organisms should give us a tree, and so should comparing any particular section of the genome between those organisms. We should end up with a set of trees that should agree with each other if the information in those genomes have accumulated through the Branching Markov Process of evolution.
However, since this message from the past comes to us over millions of years, and natural selection is carefully preserving some genes over others (along with other interference) it is not surprising that the message is somewhat noisy. Also consider that unlike our nice clean chain letter scenario, we don’t have all the final results. Most of the species in that huge tree of related organisms going back millions of years are now extinct. So in actual practice the trees that biologist get from the different sections of the genome do not all agree one hundred percent.
If the trees don’t agree one hundred percent, how much confidence should we have in the results? Going back to the branching chain letter analogy, suppose we don’t have all the final letters. And suppose the letters we do have were all copied one last time on a really bad copier. Now suppose the section by section analysis of the incomplete set of noisy letters produced trees that agreed with each other by eighty percent or so. How confident would you be that the letters were originally created by the branching chain letter process? To get a better feel for that, ask yourself what other kind of process could produce a set of letters that produce any trees at all, or trees from different sections that agree with each other in any way. If not for being generated through a branching chain letter process, you should only get nonsense from the analysis.
The same goes for DNA. Trees from genetic comparisons that agree to greater than eighty percent is almost miraculous considering that the some of the information we are using is billions of years old. Like with the noise letter analogy, if the genes in a squirrel did not come about through this Branching Markov process, comparing them to other species in the rodent family using Markov analysis would just produce nonsense. This is because the number of different combinations for mutations by any other process would be astronomical. So it would be a cosmic coincidence to get trees that agree so well. And it would be a million cosmic coincidences if the Branching Markov Process of evolution did not produce all the diversity of life on Earth, yet we could get good agreement in the sets of trees we obtain from the comparisons of the hundreds of thousands of organisms we have analyzed so far.
One final slam dunk for evolution. When we classify organisms the old school way by comparing their anatomy and behavior (as we have been doing for about 200 years now) the evolutionary tree we build from those classifications also agree with the trees we get from the mathematical gene sequence comparisons.
Using only evidence from living organisms, we can read and decode evolution’s billion year journal that comes down to us through deep time. That meduium for that message is the DNA of those organisms.