Sunday, July 02, 2017

Genome size confusion

The July 2017 issue of Nature Reviews: Genetics contains an interesting review of a topic that greatly interest me.
Breschi, A., Gingeras, T. R., and Guigó, R. (2017). Comparative transcriptomics in human and mouse. Nature Reviews Genetics [doi: 10.1038/nrg.2017.19]

Cross-species comparisons of genomes, transcriptomes and gene regulation are now feasible at unprecedented resolution and throughput, enabling the comparison of human and mouse biology at the molecular level. Insights have been gained into the degree of conservation between human and mouse at the level of not only gene expression but also epigenetics and inter-individual variation. However, a number of limitations exist, including incomplete transcriptome characterization and difficulties in identifying orthologous phenotypes and cell types, which are beginning to be addressed by emerging technologies. Ultimately, these comparisons will help to identify the conditions under which the mouse is a suitable model of human physiology and disease, and optimize the use of animal models.
I was confused by the comments made by the authors when they started comparing the human and mouse genomes. They said,
The most recent genome assemblies (GRC38) include 3.1 Gb and 2.7 Gb for human and mouse respectively, with the mouse genome being 12% smaller than the human one.
I think this statement is misleading. The size of the human genome isn't known with precision but the best estimate is 3.2 Gb [How Big Is the Human Genome?]. The current "golden path length" according to Ensembl is 3,096,649,726 bp. [Human assembly and gene annotation]. It's not at all clear what this means and I've found it almost impossible to find out; however, I think it approximates the total amount of sequenced DNA in the latest assembly plus an estimate of the size of some of the gaps.

The golden path length for the mouse genome is 2,730,871,774 bp. [Mouse assembly and gene annotation]. As is the case with the human genome, this is NOT the genome size. Not as much mouse DNA sequence has been assembled into a contiguous and accurate assembly as is the case with humans. The total mouse sequence is at about the same stage the human genome assembly was a few years ago.

If you look at the mouse genome assembly data you see that 2,807,715,301 bp have been sequenced and there's 79,356,856 bp in gaps. That's 2.88 Gb which doesn't match the golden path length and doesn't match the past estimates of the mouse genome size.

We don't know the exact size of the mouse genome. It's likely to be similar to that of the human genome but it could be a bit larger or a bit smaller. The point is that it's confusing to say that the mouse genome is 12% smaller than the human one. What the authors could have said is that less of the mouse genome has been sequenced and assembled into accurate contigs.

If you go to the NCBI site for Homo sapiens you'll see that the size of the genome is 3.24 Gb. The comparable size for Mus musculus is 2.81 Gb. That 15% smaller than the human genome size. How accurate is that?

There's a problem here. With all this sequence information, and all kinds of other data, it's impossible to get an accurate scientific estimate of the total genome sizes.

[Image Credit: Wikipedia: Creative Commons Attribution 2.0 Generic license]


  1. A few things to note:

    There isn't a single genome size for any species, because the genome is of different size for every individual due to indels and CNVs (and those can be on the order of quite a few megabases).

    For the same reasons, it probably doesn't make much sense to be talking about the genome size of an individual, because somatic mutations happen with each division, and then there are of course telomeres, whose length everyone knows changes constantly.

    So trying to pin it down to single-bp precision is futile.

    But the c-value estimates for Mus musculus are indeed lower than those for Homo sapiens

    1. then there are of course telomeres, whose length everyone knows changes constantly.

      You mean telomeres get shorter with each cell division, right?

    2. Well, if they got shorter with each and every division, there would be no linear chromosomes, right?

    3. "Yet, each time a cell divides, the telomeres get shorter. When they get too short, the cell can no longer divide; it becomes inactive or "senescent" or it dies. This shortening process is associated with aging, cancer, and a higher risk of death. So telomeres also have been compared with a bomb fuse."

      Are Telomeres the Key to Aging and Cancer

    4. You tell me :-)

      Are suggesting the genetics center for education is wrong?


    6. @Georgi

      I checked out Ryan's database before posting. The C-value for mouse ranges from 2.45 - 4.03.

      There's only a single entry for humans. The C-value is 3.50. There's quite a bit of evidence suggesting that this is too high.

      The variation within humans is on the order of 1% at most. Thus the genome size could be 3.20 or 3.23 or 3.17.

    7. The 4.03 is only one entry, the bulk of measurements are lower than the 3.5 for human (which is admittedly probably too high -- 0.978*3.5 = 3.4Gb)

  2. On the one hand, telomere length does decrease with each cell division -- usually.

    However, the enzyme telomerase can lengthen telomeres. In general, telomerase is most active in cells producing eggs and sperm, with the result that the zygote with have the "normal" set of telomeres. Telomerase is also active in stem cells and in many cancers.

    There is species-to-species variation in the shortening of telomeres and the action of telomerase, with some very long-lived birds not seeming to experience telomere shortening, at least in any consistent way.

    So, "telomeres get shorter with each cell division" is a good first thing to learn. But there's more. Reality is complicated.

  3. Great post. I agree.

    The long tandem repetitive regions are just nasty to estimate. A contig assembly can't help in such cases and the best we can do is an educated guess at the length of these sections.

    I just see N of unspecified length in many genomes browsers.

    The chimp genome size based on C-value weight is an average of 3.68 giga base pairs, ranging from 3.46 - 3.85 giga base pairs

    The latest chimp assembly covers only 3.1 giga base pairs.

    Thus this appears to imply the assembly leaves out between 10%-19% of the chimp genome.