01 Apr Mathematics / Statistics
Answer 5 questions in the attachment below in DETAILS
APA7 FORMAT, INTEXT CITATION, REFERENCES INCLUDED
Use the NOTE attachment for question 4
Question1. Why is the Fourier Transform particularly suitable for DNA data?
Q2. Describe in detail what GWAS is used for and how it works. Be sure to include an example.
Q3. Define Shannon Entropy. Provide an example using a short 10mer of DNA.
Q4. What are the four statistical approaches for phylogenetic reconstruction we discussed in class? Which of them attempt to recapitulate actual evolutionary events? Which would you use to determine the phylogeny of 35 species of rotifer?
Q5. What are the three types of nodes in an artificial neural network (ANN, sometimes referred to as ML)? What does each of them do?
,
Humans, it would seem, have a great love of categorizing, organizing, and pigeonholing things. This love affair extends to lifeforms, of course – we have been attempting to group and name plants, animals, and insects as far back as 1500 BC[footnoteRef:1]. By studying the relationships of things, we can better understand behaviors and characteristics important to agriculture, medicine, animal husbandry – and of course, evolution itself. [1: Manktelow, M. (2010) History of Taxonomy]
From your basic biology classes, you should remember that the act of classifying organisms is called taxonomy. The science that studies how those organisms evolved – and are related to one another – is called phylogeny.
In the early days of the scientific method, organisms were compared by their morphology – their physical structure and characteristics. While this works to a certain extent (and it was all we had to go on before we had DNA sequencing techniques), it caused some honestly hilarious pairings. For example, there's a ruminant primate (monkeys and cows are not in fact directly related) – and if you compare the morphology of an octopus' eye to that of humans, you can see that they must be closely related!
With the advent of DNA sequencing, scientists were able to go directly "to the source" for information on evolutionary history (phylogeny). Thanks to molecules like the small ribosomoal subunit (16S in prokaryotes and 18S in eukaryotes), we have excellent unique identifiers for species. You'll learn more about the molecular biology of how this works in other courses; for purposes of this class we are more interested in how that sequence data is used to reconstruct the evolutionary history of species.
The Data
To reconstruct phylogeny and create a phylogenetic tree, we start with a Multiple Sequence Alignment (MSA). Illustrated below is a small section of an alignment of the 18S gene from several species:
You can see substitutions as well as indels in this small sample. This information can then be used to both identify and group the species taxonomically in a variety of ways. Let's take a look at three of the most common methods of creating phylogenetic trees – Distance, Parsimony, and Bayesian.
DISTANCE
One of the simplest and oldest methods, the distance approach is still used today. It works by simply computing a distance matrix for each possible pairing of sequences. For example, given the following three sequences:
S1 aactc
S2 aagtc
S3 tagtt
We can count the substitutions between each pair and generate a matrix:
S1 
S2 
S3 

S1 
– 
1 
3 
S2 
1 
– 
2 
S3 
3 
2 
– 
Notice that this forms two "triangles", where the upper triangle is the mirror of the lower (e.g, S1 vs S2 is shown in two places, and it's the same value). Also note that comparisons of the same sequences (S3 vs S3) are just a "dash".
This is the simplest possible form of distance matrix calculation. From this, we can actually start drawing a phylogenetic tree – for example, S1 and S2 are closer to each other than they are to S3, but S3 is closer to S2 than it is to S1, so we could come up with this tree topology:
This is a "rooted" tree drawn with proportional branch lengths – meaning the distances correspond to the length of the lines. S3 is closer to S2 than S1, S2 is closer to S1 than S3!
As I mentioned above, this is a very old and simple approach. It is, however, still used today, primarily because the calculations are very easy and fast, which means that you can easily use it to compute phylogenetic trees for large numbers of species – something difficult to do with the other methods we'll talk about.
The problem with the distance approach is that it is very simplistic – it doesn't take into account any sort of evolutionary model of change, and it assumes that all mutations are equally likely. The first problem (the evolutionary model) cannot be addressed by distance methods – but we can tweak the distance method by applying a Mutation Model to provide information with regards to mutation.
Mutational Models
There are several models of mutation that can be added to the distance method. The simple method above, where all mutations are assumed to be equally likely, is called the JukesCantor method. The most popular model is the Kimura 2parameter model, which assigns different values for transitions () and transversions ():
This looks like a Markov model, doesn't it? That's because it is – a simple, 2 parameter Markov model for evolution that is used to weight the calculations when generating the distance matrix from MSA.
It is important to note that substitutions are the only element in the MSA that distance phylogeny takes into account – indels are disregarded. Yet another reason why the distance method is "simple" – and ultimately less accurate at recreating the actual evolutionary paths. Let's move on to a method that does attempt to recreate the actual evolutionary history of the species (more commonly referred to as "taxa") in question.
MAXIMUM PARSIMONY
Parsimony is defined as "the scientific principle that things are usually connected or behave in the simplest or most economical way, especially with reference to alternative evolutionary pathways." Maximum parsimony, then, means maximizing that simplicity. What parsimony algorithms are designed to do is to recreate the actual evolutionary history of the organisms being analyzed with relation to each other in a fashion that minimizes the number of steps required to traverse the entire tree – meaning minimizing the number of evolutionary changes.
The information that parsimony algorithms use to infer the evolutionary history are informative sites. These are columns in the alignment that have more than one character (e.g., A as well as C), each of which has to appear more than once. They are called informative because by having that similarity to at least one other sequence, they help inform the process of inferring the ancestral states at the nodes of the tree. You should recall that the tips of a phylogenetic tree are the currently extant taxa; the root is the common ancestor, and the middle nodes represent the species that existed at one time but are now extinct. These ancestral node sequence states are inferred using the informative sites.
We aren't going to spend too much time on maximum parsimony here, since the statistics involved are not complex and involve the same sort of substitution models that distance methods do; I do want to point out that computationally, these methods have to be heuristic rather than exhaustive – there are too many possible trees once you have, say, 30 taxa, to look at all possible tree configurations[footnoteRef:2], so these algorithms take a variety of shortcuts to find a "best" tree – primarily branch swapping to see if more parsimonious trees (with fewer steps, or changes, required) can be found. [2: See https://rdrr.io/cran/ape/man/howmanytrees.html for an example including code you can use to calculate it!]
Let's move on to a more statisticallyoriented method – Maximum Likelihood.
MAXIMUM LIKELIHOOD
Maximum Likelihood was, for a long time, considered the "third" method of building trees (after distance and parsimony). As you may have guessed, it's based on the statistical concept of maximum likelihood estimation, or MLE. MLE estimates the parameters of a probability distribution by maximizing a likelihood function such that the observed data is most probable (or likely). A simpler way of saying this is that MLE evaluates parameters (e.g., a phylogenetic tree structure) and determines how likely it is that those parameters derive from the given data (e.g., sequence data). This sounds backwards – you start with a tree, then calculate the probability that the tree "fits" the data – but it's actually very little different from the heuristic branch swapping that happens in a parsimony analysis, where the tree is modified to see if it fits better. We can define this as:
P(X)
We can read this as "what is the probability of X given "; in this case X always represents the observed data (the sequence alignment) and represents the parameters of the model (the tree topology as well as the evolutionary model selected by the user). The goal of the algorithms that perform MLE calculations is to find a value of that maximizes P (the probability of X given ). As with parsimony analysis, the number of possible trees is astronomically high once you exceed a certain number of taxa, which makes these algorithms very computetime intensive. Similarly, there are heuristic approaches that use a "starting" tree and simply optimize results based on the evolutionary model chosen to find an optimal (but probably not "best") tree. This is done by summing the likelihood at each site in the alignment, with the assumption that the sites evolve independently (a Markov chainlike model). To derive the likelihood for any given site, the algorithms calculate the probability of every possible reconstruction of ancestral states given the chosen model of substitution. Then, a branchswapping step is performed (similar to the parsimony approach above), but instead of optimizing for the minimum number of changes overall, MLE methods optimize the Likelihood calculations.
Evolution probably doesn't support the Markov chain model fully, since a mutation at one site in a proteincoding gene may cause missense or nonsense mutations – so there are evolutionary constraints involved (individuals with nonsense or missense mutations may be selected against, depending on how detrimental the mutation is). Nonetheless, these methods work sufficiently well.
Let's briefly look at one more method – Bayesian Inference of Phylogeny.
BAYESIAN
As you may have guessed, the Bayesian method of phylogenetic reconstruction is an inferential probabilistic method based on Bayes' theorem. Similar to the MLE method, it attempts to solve for the likelihood (posterior probability) that a given tree matches the data (and evolutionary models) provided. It does so, however, using the Bayes formula rather than a maximum likelihood probability. Underlying this is a Markov Chain Monte Carlo algorithm, where the probability distributions describe the uncertainty of the unknowns (e.g., the tree topology and the evolutionary model parameters). Bayes theorem is used to calculate the posterior distribution of much as MLE used the likelihood calculations:
The probability here, f(D), is also called the likelihood, but don't let that confuse you – it's the posterior probability based on Bayesian inference.
One big (and positive) difference of Bayesian inference in this case is that it makes definitive probabilistic statements about the parameters – it gives us a value, the credibility interval, or CI, that the parameter predicted is the true parameter, something that is impossible with classical statistics[footnoteRef:3]. [3: Classical statistics treats parameters as unknown constants and cannot derive them de novo]
FINAL THOUGHTS
The most common question any professor will ever hear about this topic is "which method of phylogenetic reconstruction should I use?" The answer (as you might have expected) is, "it depends". Do you need to reconstruct the phylogeny for more than 30 or so taxa? Then distance is the only approach that will finish before the heatdeath of the universe (at least until quantum computing is a real thing[footnoteRef:4]). If you are looking at fewer than 32 taxa? My advice has always been to do as many methods as you can and compare the trees – identify the common branches/nodes and draw what conclusions you can. The software called Mr. Bayes (which is – you guessed it – a Bayesian method) has become tremendously popular in the past decade, but PAUP (a maximum parsimony method) and PHYLIP (various approaches, but best at distance) are still very heavily used. [4: And yes, I'm familiar with the DWave adiabatic computer. It's not quite ready for prime time yet, at least not for bioinformatics.]
That's it for this week – be sure to check in to the discussion forums and post answers to the questions posed!
S2 S1S3