Friday, August 25, 2017

How much of the human genome is devoted to regulation?

All available evidence suggests that about 90% of our genome is junk DNA. Many scientists are reluctant to accept this evidence—some of them are even unaware of the evidence [Five Things You Should Know if You Want to Participate in the Junk DNA Debate]. Many opponents of junk DNA suffer from what I call The Deflated Ego Problem. They are reluctant to concede that humans have about the same number of genes as all other mammals and only a few more than insects.

One of the common rationalizations is to speculate that while humans may have "only" 25,000 genes they are regulated and controlled in a much more sophisticated manner than the genes in other species. It's this extra level of control that makes humans special. Such speculations have been around for almost fifty years but they have gained in popularity since publication of the human genome sequence.

In some cases, the extra level of regulation is thought to be due to abundant regulatory RNAs. This means there must be tens of thousand of extra genes expressing these regulatory RNAs. John Mattick is the most vocal proponent of this idea and he won an award from the Human Genome Organization for "proving" that his speculation is correct! [John Mattick Wins Chen Award for Distinguished Academic Achievement in Human Genetic and Genomic Research]. Knowledgeable scientists know that Mattick is probably wrong. They believe that most of those transcripts are junk RNAs produced by accidental transcription at very low levels from non-conserved sequences.

I agree with those scientists but for the sake of completeness here's what John Mattick believes about regulation.
Discoveries over the past decade portend a paradigm shift in molecular biology. Evidence suggests that RNA is not only functional as a messenger between DNA and protein but also involved in the regulation of genome organization and gene expression, which is increasingly elaborate in complex organisms. Regulatory RNA seems to operate at many levels; in particular, it plays an important part in the epigenetic processes that control differentiation and development. These discoveries suggest a central role for RNA in human evolution and ontogeny. Here, we review the emergence of the previously unsuspected world of regulatory RNA from a historical perspective.

... The emerging evidence suggests that there are more genes encoding regulatory RNAs than those encoding proteins in the human genome, and that the amount and type of gene regulation in complex organisms have been substantially misunderstood for most of the past 50 years. (Morris and Mattick, 2014)
The evidence does not support the claim that there are more than 20,000 genes for regulatory RNAs. It's more consistent with the idea that most transcripts are non-functional.

There's another speculation related to regulation. This one was promoted by ENCODE in their original 2007 preliminary study and later on in the now-famous 2012 papers. The ENCODE researchers identified thousand of putative regulatory sites in the genome and concluded ...
... even using the most conservative estimates, the fraction of bases likely to be involved in direct gene regulation, even though incomplete, is significantly higher than that ascribed to protein-coding exons (1.2%), raising the possibility that more information in the human genome may be important for gene regulation than for biochemical function.
They go on to speculate that 8.5% of the genome may be involved in regulation. Think about that for a minute. If we assume that each site covers 100 bp. then the ENCODE researchers are speculating that there might be more than 2 million regulatory sites in the human genome! That's about 100 regulatory sites for every gene!

This is absurd. There must be something wrong with the data.

It's not difficult to see the problem. The assays used by ENCODE are designed to detect transcription factor binding sites, places where histones have been modified, and sites that are sensitive to DNase I. These are all indicators of functional regulatory sites but they are also likely to be associated with non-functional sites. For example, transcription factors will bind to thousands of sites in the genome that have nothing to do with regulation [Are most transcription factor binding sites functional?].

It's very likely that spurious transcription factor binding will lead to histone modification and DNase I sensitivity due to the loosening of chromatin. What this means is that these assays don't actually detect regulatory sites or enhancers as ENCODE claims. Instead, they detect putative regulatory sites that have to be confirmed by additional experiments.

The scientific community is gradually becoming more and more skeptical of these over-interpreted genomic experiments.

The latest genomics paper on regulatory sires has just been posted on bioRχiv (Benton et al., 2017). This is a pre-publication archive site. The paper has not been peer-reviewed and accepted by a scientific journal but it's still making a splash on twitter and the rest of the internet.

Here's the abstract ...
Non-coding gene regulatory loci are essential to transcription in mammalian cells. As a result, a large variety of experimental and computational strategies have been developed to identify cis-regulatory enhancer sequences. However, in practice, most studies consider enhancer candidates identified by a single method alone. Here we assess the robustness of conclusions based on such a paradigm by comparing enhancer sets identified by different strategies. Because the field currently lacks a comprehensive gold standard, our goal was not to identify the best identification strategy, but rather to quantify the consistency of enhancer sets identified by ten representative identification strategies and to assess the robustness of conclusions based on one approach alone. We found significant dissimilarity between enhancer sets in terms of genomic characteristics, evolutionary conservation, and association with functional loci. This substantial disagreement between enhancer sets within the same biological context is sufficient to influence downstream biological interpretations, and to lead to disparate scientific conclusions about enhancer biology and disease mechanisms. Specifically, we find that different enhancer sets in the same context vary significantly in their overlap with GWAS SNPs and eQTL, and that the majority of GWAS SNPs and eQTL overlap enhancers identified by only a single identification strategy. Furthermore, we find limited evidence that enhancer candidates identified by multiple strategies are more likely to have regulatory function than enhancer candidates identified by a single method. The difficulty of consistently identifying and categorizing enhancers presents a major challenge to mapping the genetic architecture of complex disease, and to interpreting variants found in patient genomes. To facilitate evaluation of the effects of different annotation approaches on studies' conclusions, we developed a database of enhancer annotations in common biological contexts, creDB, which is designed to integrate into bioinformatics workflows. Our results highlight the inherent complexity of enhancer biology and argue that current approaches have yet to adequately account for enhancer diversity.
The authors looked at several ENCODE databases identifying sites of histone modification and DNase I sensitivity as well as sites that are transcribed. They specifically looked at databases predicting functional enhancers based on these data. What they found was very little correlation between the various databases and predictions of functionality. When they looked at independent assays using the same cell lines they found considerable variation and a surprising lack of correlation.

While this lack of correlation does not prove that the sites are non-functional, it does indicate that you shouldn't just assume that these sites identify real functional enhancers (regulatory sites). In other words, skepticism should be the appropriate stance.

But that's NOT what the authors conclude. Instead, they assume, without evidence, that every assay identifies real enhancers and what the data shows is that there's an incredible diversity of functional enhancers.
... we believe that ignoring enhancer diversity impedes research progress and replication, since, "what we talk about when we talk about enhancers" include diverse sequence elements across an incompletely understood spectrum, all of which are important for proper gene expression. [my emphasis - LAM]
I find it astonishing that the authors don't even discuss the possibility that they may be looking at spurious sites that have nothing to do with biologically functional regulation. Scientists can find all kinds of ways of rationalizing the data when they are convinced they are observing function (confirmation bias). In this case, the data tells them that many of the sites do not have all of the characteristics of actual regulatory sites. The obvious conclusion, in my opinion, is that the sites are non-functional, just as we suspect from our knowledge of basic biochemistry.

True believers, on the other hand, arrive at a different conclusion. They think this data shows increased complexity and mysterious functional roles that are "incompletely understood."

I hope reviewers of this paper will force the authors to consider spurious binding and non-functional sites. I hope they will force the authors to use "putative enhancers" throughout their paper instead of just "enhancers."

Benton, M.L., Talipineni, S.C., Kostka, D., and Capra, J.A. (2017) Genome-wide Enhancer Maps Differ Significantly in Genomic Distribution, Evolution, and Function. bioRxiv. [doi: 10.1101/176610]

Morris, K.V., and Mattick, J.S. (2014) The rise of regulatory RNA. Nature Reviews Genetics, 15:423-437. [doi: 10.1038/nrg3722]


  1. If mRNAs are so fundamental to epigenetic and developmental processes, shouldn't the array of mRNAs vary in different tissues? Neurons would have a different mix of mRNAs than cardiac muscles cells. And, if the numbers and varieties do not vary in this way, isn't that prima facie evidence these are random, non-functional accidental transcriptions?

    1. Diferrent cell types utilize different sets of transcription factors to execute their gene expression programs. For every transcription factor there exists a set of functional and a set of spurious binding sites to which that factor can bind to facilitate transcription. Because different sets of transcription factors will bind different sets of spurious as well as functional sites, the set of spurious transcripts is, like the set of functional transcripts, expected to vary with cell type.

  2. Meet 'Dark DNA' - The Hidden Genes That May Change How We Think About Evolution

    Is Dark DNA hiding in the so-called junk DNA?

    1. You seem to enjoy tilting at windmills. Why?

    2. You seem to enjoy tilting at windmills. Why?

      I think you got me confused with the ENCODE people and the like...

      I'm just trying to get to the truth just as science should be...