Friday, November 17, 2017

Calculating time of divergence using genome sequences and mutation rates (humans vs other apes)

There are several ways to report a mutation rate. You can state it as the number of mutations per base pair per year in which case a typical mutation rate for humans is about 5 × 10-10. Or you can express it as the number of mutations per base pair per generation (~1.5 × 10-8).

You can use the number of mutations per generation or per year if you are only discussing one species. In humans, for example, you can describe the mutation rate as 100 mutations per generation and just assume that everyone knows the number of base pairs (6.4 × 109).

The intrinsic mutation rate depends on the error rate of DNA replication. We don't know the exact value of this error rate but it's pretty close to 10-10 per base pair when you take repair into account [Estimating the Human Mutation Rate: Biochemical Method]. For single-cell species you simply multiply this number by the number of base pairs in the genome to get a good estimate of the mutation rate per generation.

The calculation for multicellular species is much more complicated because you have to know the number of cell divisions between zygote and mature germ cells. In some cases it's impossible to know this number (e.g.flowering plants, yeast). In other cases we have a pretty good estimate: for example, in humans there are about 400 cell divisions between zygote and mature sperm and about 30 cell divisions between zygote and mature egg cells. The number of cell divisions depends on the age of the parent, especially in males [Parental age and the human mutation rate]. This effect is significant—older parents pass on twice as many mutations as the youngest parents.

The parental age effect is comparable to the extremes in estimations of the human mutation rate based on different ways of measuring it [Human mutation rates - what's the right number?] [Human mutation rates ]. Those values range from about 60 mutations per generation to about 160 mutations per generation.

Thus, in the case of humans, we're dealing with estimates that differ by a factor of two depending on method and parental age.



-mutation types
-mutation rates
Let's assume that each child is born with 100 new mutations. This seems like a reasonable number. It's on the high end of direct counts by sequencing parents and siblings but there are reasons to believe these counts are underestimated (Scally, 2016). On the other hand, this value (100 mutations) is on the low end of the estimates using the biochemical method and the phylogenetic method.

Most of these mutations occur in the father but some were contributed by the mother. Since the child is diploid, we calculate the mutation rate per bp as: 100 ÷ 6.4 × 109 = 1.56 × 10-8 per base pair per generation. Assuming an average generation time of 30 years, this gives 1.56 × 10-8 ÷ 30 = 5.2 × 10-10 mutations per bp per year. That's the value given above (rounded to 5 × 10-10). Scally (2016) uses this same value except he assumes a generation time of 29 years.

There are many who think this value is considerably lower than previous estimates and this casts doubt on the traditional times of divergence chimps and human and the other great apes. For example, Scally (2016) says that prior to the availability of direct sequencing date the "consensus value" was 10 × 10-10 per bp per year.1 That's twice the value he prefers today. It works out to 186 mutations per generation!

I think it's been a long time since workers in the field assumed such a high mutation rate but let's assume he is correct and current estimates are considerably lower than those from twenty years ago.

You can calculate a time of divergence (t) between any two species if you know the genetic distance (d) between them measured in base pairs and the mutation rate (μ) in mutations per year.2 The genetic distance can be estimated by comparing genome sequences and counting the differences. It represents the number of mutations that have become fixed in the two lineages since they shared a common ancestor. Haploid reference genome sequences are sufficient for this estimate.

The mutation rate (μ) is 100 mutations per generation divided by 30 years = 3.3 mutations per year.

The time of divergence is then calculated by dividing half that distance (in nucleotides) by the mutation rate (t = d/2 ÷ μ). (There are all kinds of "corrections" that can be applied to these values but let's ignore them for now and see what the crude data says.)

Human and chimp genomes differ by about 1.4%, which corresponds to 44.8 million nucleotide differences and d/2 = 22.4 million. Using 100 mutations per generation as the mutation rate means 5 × 10-10 per bp per year. From t = d/2 ÷ μ we get t = 6.8 million years.

This is a reasonable number. It's consistent with the known fossil record and it's in line with the current views of a divergence time for chimps and humans.

However, there are reasons to believe that some of the assumptions in this calculation are wrong. For example, the average generation time is probably not 30 years in both lineages over the last few million years. It's probably shorter, at least in the chimp lineage where the current generation time is 25 years. Using a generation time of 25 years gives a divergence time of 5.6 million years.

In addition, the overall differences between the human and chimp genomes may be only 1.2% instead of 1.4% (see Moorjani et al., 2016). If you combine this value with the shorter generation time, you get 4.25 million years for the time of divergence.

Given the imprecision of the mutation rate, the question of real generation time, and problems in estimating the overall difference between humans and chimps, we can't know for certain what time of divergence is predicted by a molecular clock. On the other hand, the range of values (e.g. 4.25 - 6.8 million years) isn't cause for great concern.

So, what's the problem? The problem is that applying the human mutation rate (100 mutations per generation) to more distantly related species gives strange results. For example. Scally (2016) uses this mutation rate and a difference of 2.6% to estimate the time of divergence of humans and orangutans. The calculation yields a value of 26 million years. This is far too old according to the fossil record.

Several recent papers have addressed this issue (Scally, 2016; Moorjani et al., 2016a; Moorjani et al., 2016b). Most of the problem is solved by assuming a much higher mutation rate in the past. The biggest effect is the generation time in years. It may have been as low as 15 years for much of the past ten million years. Many of the problems go away when you adjust for this effect.

What puzzles me is the approach taken by Moorjani et al. in their two recent papers. They say that the "new" mutation rate is 5 × 10-10 per bp per year. That's exactly the value I use above. It's roughly 100 new mutations per child (per generation). Moorjani et al. (2016a) think this value is surprisingly low because it leads to a surprising result. They explain it in a section titled "The Puzzle."

They assume that the human and chimp genomes differ by 1.2%. That works out to 38 million mutations over the entire genome. This is 19 million fixed mutated alleles in each lineage if the mutation rate in both lineages is equal and constant.

If the mutation rate is 5 × 10-10 per bp per year then for a haploid genome this is 1.6 mutations per year. Dividing 19 million by 1.6 gives 11.9 million years (rounded to 12 million) for the time of divergence. This is the value quoted by the authors.
Taken at face value, this mutation rate suggests that African and non-African populations split over 100,000 years and a human-chimpanzee divergence time of 12 million years ago (Mya) (for a human–chimpanzee average nucleotide divergence of 1.2% at putatively neutral sites). These estimates are older than previously believed, but not necessarily at odds with the existing—and very limited—paleontological evidence for Homininae. More clearly problematic are the divergence times that are obtained for humans and orangutans or humans and OWMs [Old World Monkeys]. As an illustration, using whole genome divergence estimates for putatively neutral sites suggests a human–orangutan divergence time of 31 Mya and human–OWM divergence time of 62 Mya. These estimates are implausibly old, implying a human-oraguntan divergence well into the Oligocene and OWM-hominoid divergence well into or beyond the Eocene. Thus, the yearly mutation rates obtained from pedigrees seem to suggest dates that are too ancient to be readily reconciled with the current understanding of the fossil record.
Here's the problem. If the mutation rate is 100 mutations per generation then this applies to DIPLOID genomes. Some of the mutations are contribute by the mother and some (more) by the father. If you apply this rate to a DIPLOID genome then the number of mutations per year is 3.1 (100/30 years). Or,

         5 × 10-10 per bp per year × 6.4 × 109 bp (diploid) = 3.2 mutations per year

Dividing 19 million mutations by 3.2 give a time of divergence of 5.9 million years. This is a reasonable number but it's half the value calculated by Moorjani et al. (2016a).

They also calculate a value of 12.1 million years for the human-chimp divergence in their second paper (and 15.1 million years for the divergence of humans and gorillas) (Moorjani et al., 2016b).

I think their calculations are wrong because they used the haploid genome size rather than the diploid genome where the mutations are accumulating. Both these papers appear in good journals and both were peer-reviewed. Furthermore, the senior author, Molly Przeworski, is a Professor at Columbia University (New York, NY, USA) and she's an expert in this field.

What am I doing wrong? Is it true that a mutation rate of ~100 mutations per generation means that human and chimpanzees must have been separated for 12 million years as Moorjani et al. say? Or is the real value 5.9 million years as I've calculated above?

Image Credit: The chromosome image is from Wikipedia: Creative Commons Attribution 2.0 Generic license. The chimp photo is also from Wikipedia.

1. Scally takes this value from Nachman and Crowell (2000) who claim that the mutation rate is ~2.5 × 1008 mutations per bp in humans. This works out to 160 mutations per generation and an overall mutation rate of 8 × 10-10 based on a generation time of 30 years, not 10 × 10-10 as Scally states.

2. This assumes that all mutations are neutral. The rate of fixation of neutral alleles over time is equal to the mutation rate. Since 8% of the genome is under selection, it's not true that all mutations are neutral but to a first approximation it's not far off.

Moorjani, P., Gao, Z., and Przeworski, M. (2016) Human germline mutation and the erratic evolutionary clock. PLoS Biology, 14:e2000744. [doi: 10.1371/journal.pbio.2000744]

Moorjani, P., Amorim, C.E.G., Arndt, P.F., and Przeworski, M. (2016b) Variation in the molecular clock of primates. Proc. Nat. Acad. Sci. (USA) 113:10607-10612. [doi: 10.1073/pnas.1600374113 ]

Nachman, M.W., and Crowell, S.L. (2000) Estimate of the mutation rate per nucleotide in humans. Genetics, 156:297-304. [PDF]

Scally, A. (2016) The mutation rate in human evolution and demographic inference. Current opinion in genetics & development, 41:36-43. [doi: 10.1016/j.gde.2016.07.008]


  1. If you assume a mutation rate of 5e-10 mutations per base pair per year, and a human-chimp distance of 1.2% or 0.012, the divergence time is about 0.5 * 0.012 / 5e-10 = 12My. That is the puzzle, and it remains whatever convention you use (haploid vs diploid) for stating a number of mutations per generation.

  2. The first term in your equation (0.5 * 5e-10) is half the percentage of the HAPLOID reference genomes that differ between humans and chimps.

    The second term in your equation is the mutation rate expressed as the number of mutations per DIPLOID genome per year. We know that the mutation rate is about 100 mutations per generation in a DIPLOID genome. The mutation rate in your equation is based on the number of mutations per year per 6.4e9 base pairs.

    5e-10 X 6.4e9 X 30 years = 96 mutations/year

    1. We agree that the first term in your equation is 19 million ...

      0.5 X 0.012 X haploid genome size = 19 million

      The second term in your equation is either 1.6 mutations per year or 3.2 mutations per year depending on who does the calculation.

      The data says about 100 mutations per generation so the second value is the correct one.

    2. i think here is another interesting case: according to evolution fly and mosquito split off about 250 my ago. fly generation is about less then one month. so even if one generation mean only 1 new mutation (i think in reality they get much more) we will need only 10^8 month to change their entire genome. or about less than 10^7 years. so fly and mosquito should be different in about their entire genomes from each other (or at least 75% difference). far from reality as far as im aware about.


    4. are you saying that the entire genome is conserve?

    5. You're not accounting for saturation effects, either. If there's a 10% difference between genomes, then there's a 0.1 chance of a mutation occuring where the genomes are already divergent. I'm also not sure where you get the 205ma figure from. Misof et al. 2014 has the split at ~160Ma and the refined value we got (preprint: is at ~120Ma. That figure is half of the divergence time you assume, which is closer to the Diptera - (Mecoptera+Siphoneraptera) split in both Misof et al. and the paper in prep.

    6. And then there's the fact that mutations don't happen uniformly across the genome either. The chance of mutation is not equiprobable across all nucleotides.

    7. "2014 has the split at ~160Ma and the refined value we got"-

      but it doesnt matter at all. even 100my is too much. i just assume about 100 new mutations per generation. and this is the number you will get. so something is very weird here.

      "And then there's the fact that mutations don't happen uniformly across the genome either."-

      but we are talking about the entire genome here. are you saying that the majority of the genome doesnt get any mutations?

    8. All the genome gets mutations. However, mutations happen more often at certain "hotspots" than elsewhere. (Why? I don't know.)

      Only part of the genome is conserved. DNA sequences for genes, regulatory regions, telomeres, and other useful functions are conserved. Mutations happen there, but most of the mutations are selected out because they do harm. In the "junk" regions of the DNA (about 90% of the DNA in humans, sequence is not conserved.

    9. What regulates the "hot spots"?
      What regulates mutations in overlapping genes?

  3. The amount of difference between human and chimp DNA should be the product of the mutation rate per year with the sum of the divergence time plus the coalescence time of two gene copies, times 2. One must be careful not to leave out the coalescence time, which is a number of generations equal to twice the effective population size.

  4. This is from the PNAS paper by Moorjani et al. I've lost track of what the puzzle is.

    "we estimate that humans diverged from chim-
    panzees ~12.1 Mya and from gorillas ~15.1 Mya (Fig. 4). Assuming
    further that the effective population size of the human–ape an-
    cestor was five times the current population size (as estimated by
    refs. 43, 44), the human–chimpanzee split time is ~7.9 Mya, and
    the human–gorilla split time is 10.8 Mya. We note that there is
    substantial uncertainty in estimates of ancestral population size of
    apes, with previous estimates ranging between 50,000 and 100,000
    (43–45). Accounting for this uncertainty provides estimates of
    human–chimpanzee split time in the range of 6.5–9.3 Mya, and
    human–gorilla split time in the range of 9.4–12.2 Mya."

  5. It sounds as if they have taken coalescence effects properly into account.

  6. Fossil record should not be a factor in figuring genetic operations.! this must stop about studying rocks and using it to confirm biology hypothesis.
    This divergence concept is entirely based on observing modern mutation rates which do not produce much mutation in people.
    this idea allows for no other options.
    If a creator created biology in finished form and hen mutations were going on in trivial ways IT WOULD be a math error, or a concept error, to extrap[olate back from this mutation rate.
    mutation rates do not demonstrate man/ape divergence but only show a divergence score, AFTER A PRESUMPTION, and so they show nothing of genetic evidence for common descent.
    its still all reinforcing presumptions without scientific methodology.
    you can't falsify any of these rates conclusions unless already agreeing that these mutations have been going on.

    1. Robert, you ever hear the phrase from carpentry "Measure twice, cut once"? The biology and the rocks are measuring twice, that's all.

    2. But you presume the existence of a creator god, despite the utter nonexistence of any testable evidence to that effect.

      Mote, beam, Robert Byers.

  7. Going back to the beginning of the post, the difference between the mutation rates calculated by biochemical measurements and the direct counts from parent/sibling sequences seems to me likely to be a product of differential mortality. Parents and siblings are survivor, no, therefore a biased sample. (Yes, direct counts from non-survivors are impossible.) For the purposes of estimating time of evolutionary divergence, though, it appears that the direct counts are the best estimate.

    The thing is, if there is a period in the evolutionary history when the constraints of natural selection are slackened, so that variations in phenotype that previously suffered differential mortality instead prosper, the mutation rate as measured in direct counts would be higher. And it seems to me that a population moving into a new area could indeed tolerate a higher effective mutation rate (as I suppose you could call it.) Obviously this is just a version of radiative adaptation.

    Doesn't it seem as though these divergences in estimates are really suggesting a period of radiative adaptation in those lineages?

    1. There are plenty of ways to modify the mutation rate in order to account for various effect that the crude values ignore. There are several ways of adjusting the formula for calculating times of divergence. One of them involves taking coalescence and population size into account as Joe pointed out above.

      These are the main points of the papers I referenced but it's not the main point of my post. I'm asking whether Moorjani et al. have done the arithmetic correctly.

  8. Why do you have negative signs in front of you exponents? e.g.

    "In humans, for example, you can describe the mutation rate as 100 mutations per generation and just assume that everyone knows the number of base pairs (6.4 × 10-9). "

    I'm pretty sure you mean 10 to the 9th power or billions here, right?

    Also this page says, "The human genome contains approximately 3 billion of these base pairs..."

    Your quote above says the number of base pairs is 6.4 billion. What are you referring to?

    1. Thank-you for spotting the typo. I fixed it.

      The current version of the human reference genome (GRCh38.p11) contains 3,092,480,053 base pairs. There are still quite a few gaps but the approximate size of most gaps is known. The total genome size is estimated to be 3,253,848,404 bp.

      Diploid cells (e.g. human zygote) contain twice this amount of DNA.

  9. Larry Moran-
    From the paper it appears Moorjani et al. does use the haploid rather than diploid genome size.
    Your arithmetic is correct.
    It seems you have solved ‘the puzzle’.