{"id":663,"date":"2017-02-19T10:58:30","date_gmt":"2017-02-19T15:58:30","guid":{"rendered":"http:\/\/theappleandthefinch.com\/?p=663"},"modified":"2017-03-10T20:43:04","modified_gmt":"2017-03-11T01:43:04","slug":"decoding-evolution","status":"publish","type":"post","link":"http:\/\/theappleandthefinch.com\/2017\/02\/19\/decoding-evolution\/","title":{"rendered":"Decoding Evolution"},"content":{"rendered":"<p>Unless you were raise by wolves, you will not be surprised to hear that the DNA of differnt organisms contains a kind of recipe for developing and maintaining that organism over its lifetime. \u00a0Squirrel DNA has a recipe for squirrels, dog DNA for dogs and so forth.<\/p>\n<p>But wolves not withstanding, you might be surprised to hear that there is another extremely ancient message in DNA that comes to us over billions of years. \u00a0It is an evolutionary journal of the stages of evolution that led up to squirrels, dogs, and every other living organism from a single common ancestral population billions of years ago.<\/p>\n<p>More surpring is that the message is buried not in one set of DNA, but in the pattern of similariteis and differences in DNA between different organisms. \u00a0You won&#8217;t get this message from only analyzing squirrel DNA, but it will start to surface as you compare the genes of a number of related organisms.<\/p>\n<p>Powerful mathematical techniques can be applied to\u00a0comparing the gene sequences of related organisms to decode their evolutionary history. Now with the advent of cheap and fast computing power, and cheap and fast gene sequencing, biologists all around the world are busy analyzing DNA and constructing the entire evolutionary tree of life using these techniques.<\/p>\n<p>But how is it that mathematics can be used in biology to chart the history of an organism&#8217;s evolution? What kind of information would be in DNA that could be decoded with mathematics and how was that discovered?<\/p>\n<p>In the latter part of the 20th century, as molecular genetics came into its own, biologists realized that the steps in the process of evolution fulfill the definition of what is called a Branching Markov Process. \u00a0Markov processes are a class of processes that share a certain set of properties that were characterized very early in the 20th century by Russian mathematician <a href=\"https:\/\/en.wikipedia.org\/wiki\/Andrey_Markov\" target=\"_blank\">Andrey Markov<\/a>. \u00a0While this realization might sound ike something of interest to only mathemeticians, the implications\u00a0are:<\/p>\n<ul>\n<li>If\u00a0the diversity of all life on Earth has come about through evolution from a common ancestral population, the history of that\u00a0billion year journey is recorded in the DNA of organisms alive today.<\/li>\n<li>The\u00a0genetic message that comes to us across deep time\u00a0is encoded in the highly unique and recognizable pattern produced by a Branching Markov Process.<\/li>\n<li>We should be able to confirm the theory of evolution for all life on Earth by finding\u00a0the Markov message in the DNA comparisons between different living organisms.<\/li>\n<li>The message contains what we need to recreate the tree of life. \u00a0We can recreate that tree branch by branch by sequencing\u00a0the DNA of related organisms compare them using Markov mathematics.<\/li>\n<\/ul>\n<blockquote><p>&#8220;Mathematics is the languge with which God has written the universe.&#8221; &#8211; Galileo Galilei<\/p><\/blockquote>\n<p>Galileo would probably not be too surprised to hear that the process that gave rise to the entire diversity of life on Earth can be described mathematically. \u00a0Nor would he be surprised to hear that the same mathematics can be used to decode the history of that process. \u00a0The mathematics of Markov analysis are pretty fierce, but fortunately Markov processes are easy to describe using no mathematics at all. \u00a0I promise that, and I promise that there will be no quiz on this on Friday.<\/p>\n<h2>A Drunkard&#8217;s Walk<\/h2>\n<p>First let me define a Markov Process as a process that changes the state of something in a small random way over and over again. \u00a0The most popular example would be a drunk person walking across a field. \u00a0The &#8220;state&#8221; of the walking drunkard at any given moment would be the position of the step he has just taken. As he staggers around the field, each new step is a new state. \u00a0If you record the position of each step he takes, \u00a0you will have a sequence of positions or states. Markov analysis would call the sequence of states a Markov Chain.<\/p>\n<p>What makes this a Markov Process that produces a Markov Chain is that each new state is closely related to its previous state. The drunkard can only change his position equal to or less than one stride length. If you number each state (or step, in this case) you would find that the any state Sn is not much different than its predecessor Sn-1. \u00a0The difference is equal to or less than the length of the drunkards stride.<\/p>\n<p>Markov processes have a very short memory. \u00a0They don&#8217;t look back very far in their history. \u00a0If I told you the location the drunkard was in at this very moment, you can&#8217;t predict where his next step will be, but you can\u00a0predict that it\u00a0will land somewhere within a circle whose radius is no longer than the length of one stride. \u00a0Notice I did not have to tell you anything\u00a0about his previous positions leading up to where he is now for you to predict\u00a0the possibilities for his next step. \u00a0So this drunkard&#8217;s walking process has &#8220;the Markov Property&#8221;, where each new state is only dependent on its current\u00a0state.<\/p>\n<p>If you happened to run into the drunkard just standing in the field you might ask him where he came from. \u00a0If he can&#8217;t manage to remember, you would be hard pressed to figure that out. All you know is that the step before the position he is standing on is within a circle that is no larger than one length of his stride.<\/p>\n<p>So far it doesn&#8217;t seem like Markov process is much of a help in figuring out the history of a process. But let&#8217;s take another example. \u00a0This time the process is that a number of different people around the country are going to make a chain letter\u00a0before there was email. \u00a0They are instructed to each sign the letter, make a copy, and mail\u00a0the copy to another person\u00a0who has not already recieved and signed the letter. \u00a0Suppose the letter\u00a0has no signing page and no room anywhere to sign it, so the signers put their name anywhere on the letter\u00a0they can find room.<\/p>\n<p>Now suppose you are the last guy to get the letter. \u00a0You get your copy in the mail and it has all the signatures except yours. Stepping back for a moment, notice that this is another Markov Process. \u00a0The letter has\u00a0gone through a series of states where each copy is the same as the one before it but has one more signature. But notice that his time the copy arrives in the mail and it has more information about it&#8217;s Markov Process than the drunkard case. If it has 23\u00a0signatures except yours you at least know it went through 23 previous states. \u00a0You know this because this particular Markov process left some evidenc. The small changes from state to state accumulated on the document. The only problem is that since they were signed all over the front page, you still don&#8217;t know what order the states occurred in. \u00a0In fact, the signers might have all met\u00a0at a high school reunion\u00a0somewhere and all signed it at once and then sent you a copy.<\/p>\n<p>This is fun but how does this relate to evolution, you might ask? \u00a0Let&#8217;s suppose you caught a squirrel, took a sample of its DNA and sequenced it? \u00a0You now have the equivalent of that chain letter. \u00a0The squirrel is the offspring of a set of parents who at conception made a copy of their DNA and passed it on to the first cell of the new squirrel. But since the copying process is imperfect there is a chance of a slight mutation (or more) taking place that altered the DNA copy. \u00a0The same thing happened when the squirrel&#8217;s parents were born from the grandparents, and so forth. \u00a0The DNA has gone through a couple of billions of years of imperfect copying\u00a0where each copy varied slightly from the previous one by a small random mutation. And like the letter\u00a0signing example, the results of each step along the way accumulated in the DNA in the form of mutations generation after generation all the way down to the squirrel you have in your\u00a0laboratory cage. I think the implications are clear so far that the process of evolution, among other things, is a Markov Process when it comes to what happens to DNA over the many generations it is handed down.<\/p>\n<p>But we still have the same problem as the chain letter. \u00a0We don&#8217;t know the order in which the mutations occurred. \u00a0So how can we chart the evolutionary history of the squirrel? All we can see is the one final state, which has carried with it all the accumulated mutations like so many signatures. Fortunately for us and for evolutionary biologists, the analogy of the contract signing process is not complete\u00a0when it comes to the Markov Process of evolution. \u00a0There is more that we can examine than the DNA of a single organism.<\/p>\n<p>To get closer to what happens in evolution, let&#8217;s suppose a new scenario where people are mailing chain letters. \u00a0 But let&#8217;s modify the rules to produce a mega-chain letter. \u00a0Let&#8217;s have each person sign the letter but make ten copies and distribute them to five different individuals who are supposed to do the same thing. \u00a0Notice that this still describes a Markov Process, but rather than producing one Markov Chain it produces a large number of chains that are all different in a certain way. \u00a0When Bob receives a letter he signs it and makes copies. But when he sends each copy to five different people he is establishing five branches off of\u00a0his particular chain. \u00a0Why? \u00a0Because each of the five copies will now receive a different signatures in their next step in the process. And then the same thing will happen again as those people sign, copy, and forward to five more people.<\/p>\n<p>If you drew a diagram for this, each state would produce new branches that go off in on their separate paths, further branching at each new state. \u00a0A diagram of this would show the branches fanning out from one point, then each branch fanning out again, and so on, to form a tree-like structure.\u00a0You could trace a path on the diagram from the originator of the letter all the way to any\u00a0final recipient. \u00a0That one path would be a Markov Chain, because they are all different Markov Chains. \u00a0But the entire tree of chains would be a set of Branching Markov Chains.<\/p>\n<p>Now if you picked up one of the letters from its final recipient, you would still have the same problem as the contract signing. \u00a0You have a document with a lot of signatures, but no way to tell in which order they occurred. But something really interesting happens if you have all the final letters. The first thing you would notice is that none of them have the exact same set of signatures. \u00a0But the next thing you would notice is that they all have at least one signature in common (the originator of the letter, let&#8217;s call him Bob). As you keep comparing them all, the next thing you might notice is that the letters can be sorted into five groups, where the letters in a particular group share two signatures, one being Bob and one being someone else. \u00a0Let&#8217;s say the first group is the Bob-Mary group and the second is the Bob-Joe group and so on. \u00a0These second shared signatures from the first set of five recipients from Bob&#8217;s mailing. The next thing you might notice is that the Bob-Mary letters can be further broken down into five subgroups where the letters in each subgroup share another signature. \u00a0That would be the Bob-Mary-Pete subgroup and the Bob-Mary-Sue subgroup and so on.<\/p>\n<p>By now you can see a pattern emerging. \u00a0The similarities and differences in the signatures on the final letters contain\u00a0information about the branching that a single letter does not have. And the successive grouping and dividing of the letters allow you to redraw the original tree that describes the order of the letter signing\u00a0that originally took place, even if you did not already know how the letters were created.<\/p>\n<p>If you start diagraming this, \u00a0since the letters all have the Bob name, you plant the tree at Bob. \u00a0Seeing that all letters can be sorted in groups of five by another name, you then can draw the Bob-Mary branch, then the Bob-Joe branch, and so on. \u00a0Then because each of those groups can be further divided into their own five subgroups you can draw those branches and so on. \u00a0Soon you will have drawn the entire tree that describes the original routing that all the letters took from one person to the next.<\/p>\n<p>You might notice that this tree that you draw from the letter comparisons looks like a genealogy chart. \u00a0The similarity is not a coincidence. If you substitute the letter for a human genome, and you realize that each child born inherits a copy of the parents&#8217; genes but with a random variation, you can see that the building of an extended family in each generation is a Branching Markov Process. \u00a0And the states that are being changed along each branch point is the state of the genome as it is copied and handed off to the next generations where it accumulates mutations just like the branching chain letter accumulated signatures.<\/p>\n<p>The final results will produce relatives who can be grouped by their genetic differences and similarities just like the final chain letters could be grouped by similarities and differences in the signatures. \u00a0And just like the chain letters, you can recreate the family tree diagram for those youngsters even without having the DNA of any of their parents, grandparents, or any of their ancestors. \u00a0Now simply project that notion back milions of generations and you can see that the entire world of living organisms is related in the same way that people in an extended family are related. \u00a0The reason why we know that is the same reason why we can tell siblings from cousins and second cousins and so forth in a family tree by comparing their DNA.<\/p>\n<p>Evolution is a Branching Markov Process and it accumulates information across the genomes of related organisms in the precise pattern that we would expect from that kind of process.\u00a0Encoded in that pattern is evolution&#8217;s billion year journal of where it has been and what it has accomplished.<\/p>\n<p>If that seemed too easy, you are correct. \u00a0Sorting piles of messages doesn&#8217;t seem very math intensive. \u00a0But consider that the genome has billions of base pairs that could be mutated in any which way. \u00a0Sorting them by hand would take to the end of time. Fortunately, powerful computer techniques can be used to do the analysis.<\/p>\n<p>To give you a feel for what those techniques have to deal with, let me extend my last branching email problem by one more challenging angle. Suppose your job at the FBI was to establish the alibi of many of the participants in the email chain letter. \u00a0But to make the problem more challenging, suppose none of the chain letter participants signed any of the letters.<\/p>\n<p>You go back to your office and ponder this for a while and then while staring at the letters spread out on your desk you realize that some of them have other artifacts on them. \u00a0It looks like some of the letters in the process laid around for a while on peoples&#8217; desks before they were signed, copied, and mailed out. One group has a coffee mug stain, another has the remnants of a swatted fly, another group has a phone number written in the corner, another has doodles, and so forth. \u00a0While looking for other artifacts, you notice that some of the copiers had a dirty glass\u00a0causing some\u00a0letters to have little smudges and nits that the copiers picked up.<\/p>\n<p>Why this is important is that each letter picked up unique artifacts at the same location where it was copied, and the artifact was copied five times along with the signature. \u00a0That means you might be able to do the same analysis using the artifacts alone.\u00a0In fact\u00a0there might be further artifacts due to the unique lens distortions in the copiers at each location as well. \u00a0That means you might be able to do the analysis using the distortion from each of the lenses and the artifacts found on the letters. \u00a0While they are good standins for the missing signatures they are distributed all over each of the letters.<\/p>\n<p>So now you decide to pull out your heavy hitter Markov Analysis software. \u00a0You scan each letter with a very high resolution scanner which turns the letter into a grid of pixel values like any digital image. Then you feed them into the bigtime Markov software which uses the similarities and differences between all the letters on the basis of the pixel values. \u00a0That way it can include the effect of the all the artifacts from lying around each office and from the lens distortions of each copier. The software crunches for a half hour or so and then comes back with\u00a0a tree for the same reason that your manual signature sort produced a tree.<\/p>\n<p>Just to be sure you instruct the software to redo the analysis a number of times using different specific areas on the letters. \u00a0For example, you have it analyze all the letters but only using the top left hand quarter of each letter, then repeat that for the top right hand quarter, and so on. Each analysis produces its own tree that it coaxed out of its particular area of the letters.<\/p>\n<p>If you can see why each of the trees it produces should be identical (or very close) then you are starting to realize the implications of how one could prove almost beyond a shadow of a doubt that the letters were indeed produced by a branching chain letter process (which is a branching Markov Process). \u00a0It would be very difficult if not impossible to fake all those letters in a way that could possibly produce the same results unless the information accumulated through the branching chain letter process.<\/p>\n<p>The unsigned branching chain letter analysis analogy is very close to what happens when DNA from different related species are sequenced and analyzed. \u00a0Each genome from each different organisms is like the final letters in the analogy, where the billions of base pairs that make up the information in the genome are like the pixels from the images of the letters. Comparing the\u00a0whole genomes of a group of different but related organisms should give us a tree, and so should comparing any particular section of the genome between those organisms. \u00a0We should end up with a set of trees that should agree with each other if the information in those genomes have accumulated through the Branching Markov Process of evolution.<\/p>\n<p>However, since this message from the past comes to us over millions of years, and natural selection is carefully preserving some genes over others\u00a0(along with other interference) it is not surprising that the message is somewhat noisy. Also consider that unlike our nice clean chain letter scenario, we don&#8217;t have all the final results. \u00a0Most of the species in that huge tree of related organisms going back millions of years are now extinct. \u00a0So in actual practice the trees that biologist get from the different sections of the genome do not all agree one hundred percent.<\/p>\n<p>If the trees don&#8217;t agree one hundred percent, how much confidence should we have in the results? \u00a0Going back to the branching chain letter analogy, suppose we don&#8217;t have all the final letters. \u00a0And suppose the letters we do have were all copied one last time on a really bad copier. \u00a0Now suppose the section by section analysis of the incomplete set of noisy letters produced trees that agreed with each other by eighty percent or so. How confident would you be that the letters were originally created by the branching chain letter process?\u00a0To get a better feel for that, ask yourself what other kind of process could produce a set of letters that produce any trees at all, or trees from different sections that agree with each other in any way. \u00a0If not for being generated through a branching chain letter process, you should only get nonsense from the analysis.<\/p>\n<p>The same goes for DNA. \u00a0Trees from genetic comparisons that agree to greater than eighty percent\u00a0is almost miraculous considering that the some of the information we are using is billions of years old. \u00a0Like with the noise letter analogy,\u00a0if the genes in a squirrel did not come about through this Branching Markov process, comparing them to other species in the rodent family using Markov analysis would just produce nonsense. \u00a0This is because the number of different combinations for mutations by any other process would be astronomical. \u00a0So it would be a cosmic coincidence to get trees that agree so well. \u00a0And it would be a million cosmic coincidences if the Branching Markov Process of evolution did not produce all the diversity of life on Earth, yet we could get good agreement in the sets of trees we obtain from the comparisons of the hundreds of thousands of organisms we have analyzed so far.<\/p>\n<p>One final slam dunk for evolution. When we classify organisms the old school way by comparing their anatomy and behavior (as we have been doing for about 200 years now) the evolutionary tree we build \u00a0from those classifications also agree with the trees we get from the mathematical gene sequence comparisons.<\/p>\n<p>Using only evidence from living organisms, we can read and decode evolution&#8217;s billion year journal that comes down to us through deep time. \u00a0That meduium for that message is the DNA of those organisms.<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Unless you were raise by wolves, you will not be surprised to hear that the DNA of differnt organisms contains a kind of recipe for developing and maintaining that organism over its lifetime. \u00a0Squirrel DNA has a recipe for squirrels, dog DNA for dogs and so forth. But wolves not withstanding, you might be surprised [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[],"_links":{"self":[{"href":"http:\/\/theappleandthefinch.com\/wp-json\/wp\/v2\/posts\/663"}],"collection":[{"href":"http:\/\/theappleandthefinch.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/theappleandthefinch.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/theappleandthefinch.com\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/theappleandthefinch.com\/wp-json\/wp\/v2\/comments?post=663"}],"version-history":[{"count":2,"href":"http:\/\/theappleandthefinch.com\/wp-json\/wp\/v2\/posts\/663\/revisions"}],"predecessor-version":[{"id":787,"href":"http:\/\/theappleandthefinch.com\/wp-json\/wp\/v2\/posts\/663\/revisions\/787"}],"wp:attachment":[{"href":"http:\/\/theappleandthefinch.com\/wp-json\/wp\/v2\/media?parent=663"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/theappleandthefinch.com\/wp-json\/wp\/v2\/categories?post=663"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/theappleandthefinch.com\/wp-json\/wp\/v2\/tags?post=663"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}