Sunday, February 12, 2017

ENCODE workshop discusses function in 2015

A reader directed me to a 2015 ENCODE workshop with online videos of all the presentations [From Genome Function to Biomedical Insight: ENCODE and Beyond]. The workshop was sponsored by the National Human Genome Research Institute in Bethesda, Md (USA). The purpose of the workshop was ...

  1. Discuss the scientific questions and opportunities for better understanding genome function and applying that knowledge to basic biological questions and disease studies through large-scale genomics studies.
  2. Consider options for future NHGRI projects that would address these questions and opportunities.
The main controversy concerning the human genome is how much of it is junk DNA with no function. Since the purpose of ENCODE is to understand genome function, I expected a lively discussion about how to distinguish between functional elements and spurious nonfunctional elements.

I also expected a debate over the significance of associations between various molecular markers and disease. Are theses associations reproducible and relevant? Do the molecular markers have anything to do with the disease?

I looked at most of the videos but I saw nothing to suggest the workshop participants cared one hoot about either of these debates. Perhaps I missed something? If anyone can find such a discussion please alert me.

There was no mention of junk DNA and no mention of the failed publicity hype surrounding publication of the the 2012 papers. It was as though that episode never existed. The overwhelming impression you get from looking at the presentation is that all the researchers believe all their data is real and reflects biological function in some way or another.

The planning stage was all about collecting more and more data. Nothing about validating the data they already have. This workshop really needed to invite some of their critics to give presentations. These PI's needed to hear some "alternative truths"!

The closest thing I could find to the thinking of the participants was a slide from a talk by Michael Snyder. I assume it reflects the thinking of ENCODE leaders.

It's true that the number of protein-coding genes hasn't changed very much in the past 50 years or so. If anything, it's gone down a bit so that today we think there are fewer than 20,000 protein-coding genes. ENCODE did very little to change our view of protein-coding genes.

Prior to ENCODE there were dozens and dozens of known genes for functional noncoding RNAs. The number of proven genes in this category has crept up little by little as proven functions are found for some conserved transcripts. Today, it's conceivable there might be as many as 5,000 genes for functional noncoding RNAs. I don't think that's what Michael Snyder meant. I think he meant 100,000 or more genes but I can't be sure. In any case, even the most optimistic estimate, 100,000 genes, would only occupy a few percent of the genome.

The original 2012 ENCODE papers talked about millions of regulatory sequences. What Michael Snyder is saying here is that there's more "potential" regulatory sequences than coding DNA. That would be more than 1.5% of the genome or 48 million bp. Assuming 48 bp per regulatory site, that's one million regulatory sequences. It's enough for 40 regulatory sites for every known gene in our genome.

That doesn't make a lot of sense to me. Does anyone know of a single example of a gene whose expression is regulated by factors binding to DNA at 40 different sites?

The slide leaves out the most important thing about function; namely how much of the genome is functional. I'd love to know if the view of ENCODE researchers on junk DNA has changed between 2003 and 2015.


31 comments :

  1. Larry,

    I'm not going to pretend to be an expert on the issue of jDNA but I once came a cross an article that the majority of human genome is no longer functional because it was needed in the developmental stage of the organism.

    I've read a little about knockout experiments but they are inconclusive in some mammals.

    ReplyDelete
  2. "Does anyone know of a single example of a gene whose expression is regulated by factors binding to DNA at 40 different sites?"


    Factors? Do polycomb repression group (PcG) binding sites count? Technically they aren't factors, but they are important and they need places to park.

    The number of tandem repeats may indicate the number of bound regulatory factors in a few cases. A good example is the Dystrophin gene. The tandem D4Z4 repeat has a separate Polycomb group on every repeat! In the case of muscular dystrophy, when the tandem repeat number is below 11, it is associated with disease. So there are at least 11 binding sites involved in regulation, probably more since healthy individuals have up to 100 repeats, so with a Polycomb Group binding to each of the repeats, that would easily be 100 binding sites for that gene alone.

    http://www.cell.com/cell/abstract/S0092-8674(12)00463-1

    One ENCODE meeting (open to the public) that I was a part of had a research protesting tandem repeat weren't actively annotated and tracked by ENCODE! I guess even with all the data that ENCODE gathers, there's still stuff the slips through the cracks.


    Beyond that, binding doesn't necessarily have to happen at the same time, but with different transcription factories and topologically associated domains for each cell type (213 canonical, maybe thousands with a looser definition), then it is clearly possible that 48 sites could be involved. In one cell type a gene may be regulated by sites in 1 chromatin conformation and one set of chromosomes, and in another cell type regulated by a different chromatin confirmation and different chromosomes. That's driving John Rinn's Cat's cradle hypothesis, and he was motivated by his own discovery of the FIRRE lincRNA doing something that would support a Cat's cradle. That also, no doubt, motivated the creation of ENCODE's sister project, the 4D Nucleome.


    So a lot of the DNA acts as a parking lot (binding site) for molecular machines servicing a gene or transcription factory. These machines do a lot of histone modification and DNA methylation.


    Since many genes are processed in transcription factories, and the factories are different for the 213 canonical cell types (and maybe thousands of cell types depending on one's definition of cell type), it's possible there will be many different bindings depending on cellular context. Robert Tjian (who publishes with the sister project of ENCODE known as 4D Nucleome) mentioned it in his video on gene transcription. He called it combinatorial gene regulation.


    "A reader directed me to a 2015 ENCODE workshop "

    You mean me? :-)

    ReplyDelete
    Replies
    1. If I follow correctly, those seem to be examples that would answer Larrys' question. But as always we are faced with a known reality that the bulk of the putative junk DNA is derived from what we know, by their origin, to be vagabond DNA elements like transposons, viral DNA inserts, and tandem repeats. Any finding of regulatory DNA or a functioning stretch of expressed RNA among this stuff is more likely an instance of secondary recruitment. By quantity, these findings barely put a dent into the large amount of what still seems to be junk DNA that is tolerated in eukaryote genomes. If this view is wrong, then we will have to rethink what we understand about transposons and viruses and slippage of DNA during replication.

      I get the impression that the 'ENCODians' are like a bunch of prospectors on the side of an enormous mountain of slag, into which they dig to find an occasional small lump of gold. In this analogy, like when the ENCODians find something, the prospectors greatly inflate the significance of their occasional lump of gold by declaring that this shows the whole mountain could be gold. Then comes a gushy press release saying as much: "Prospectors discover that an enormous mountain of slag could be mostly gold!!!"
      To paraphrase a well known saying: the ENCODians have been trying to fool us. But before they did that they took great pains to fool themselves.

      Delete
    2. Sal Cordova tried to answer my question. I asked whether anyone knew of a gene with at least 40 proven regulatory sites. (Recall that this is supposed to be the AVERAGE for human genes.)

      Sal rambled on a bit but the bottom line is that he doesn't know of an example.

      Does anyone else?

      Delete
    3. Sal asked,

      You mean me? :-)

      Yes, I meant you. Do you give me permission to edit my post by saying it was Sal Cordova who gave me the link?

      Delete
    4. "Sal rambled on a bit but the bottom line is that he doesn't know of an example."

      Don't those 100 binding sites for the Polycomb Groups on the Dystrophin gene qualify as important for regulation, thus we have up to 100 binding sites for 100 separate Polycomb Groups. So why does my example not qualify?

      Why wouldn't those 100 sites be considered as participating in regulation? Obviously if enough of those repetitive sites are missing, it results in a disease state.

      Or is your objection that it isn't proven.

      I'm not trying to be combative on this point, but if a lot of gene regulation involves histone readers, writers, and erasers binding to histones on the gene or wherever, then doesn't this qualify as some sort of relevant regulatory binding site?

      I know you used the word "factor", but don't polycomb groups count as some sort of regulatory complex?

      Thanks anyway for reading my comment.

      Delete
    5. It seems to me most of those PG binding sites are for ensuring effective silencing of so large a gene, not because they're really envolved in some baroquely complex regulatory function of expression levels.

      It's a huge gene, so it takes a lot of material to prevent spurious transcription of so much DNA. I could easily imagine noisy transcription of such a large protein could interfere with normal cellular processes. As far as I can gather, that is indeed what happens if mutations happen in the binding sites, the result is disease. So the gene needs to be transcribed properly and in it's entirety, so you need lots of binding spots for silencers to keep it inactive when it isn't needed.

      It becomes a matter of semantics then, because I would agree silencing is a form of regulation. So technically I could agree dystrophin has 100 regulatory binding sites.

      Delete
  3. A million regulatory sequences isn't at all unreasonable based on what is known about enhancers. The relatively few that have been experimentally characterized consist of multiple short protein binding sites arranged in clusters. There seems to be some redundancy in function in that a enhancer with say six binding sites for protein X might retain it's function as long as at least one of the six sites was intact.

    Moreover a single binding site isn't going to be 48bp in length (thats the resolution of the assay to detect the presence of a binding site). Binding sites themselves are more like maybe 10bp in length and some of that sequence is degenerate.
    In vitro binding requires maybe 20bp but that's in order for the DNA to assume a conformation that approximates its conformation in vivo. Of those 20bp only 2-4 at most are going to make base specific contacts with the protein, with additional contacts being made with the phosphates of the backbone. Because the sites themselves are short and DNA conformation isn't nearly as dependent on sequence as protein conformation is, regulatory sequences aren't nearly as sensitive to mutation (transitions, transversions, insertion, deletion, inversion etc) as coding sequence is. A million transcription factor binding sites might occupy 1% or more of the genome but the amount of literal (as opposed to indifferent sequence) required to make them biologically functional is only going to be a fraction of that.

    ReplyDelete
    Replies
    1. One million regulatory sequences means an average of 40 per gene. I don't know of a single well-studied gene that has 40 regulatory sites, do you? It is not reasonable.

      There are about 100 genes for ribosomal proteins. Why would they need 40 regulatory sites? Why would most of the thousands of housekeeping genes need so many regulatory sites?

      Delete
    2. It seems to me that if a gene has 40 regulatory sites, the difference between, say, number 38 having a regulatory function and its having no function at all would be too small to measure.

      Delete
    3. Larry,

      An average of 40 per gene is just that, an average. Genes showing complex patterns of expression (spatially and temporally restricted) are likely going to have more regulatory sites and house-keeping genes are likely going to have fewer.

      An example of a gene with at least 40 regulatory sites is SHH (sonic hedgehog). It's expressed in a number of different sites during development and has a number of different enhancers (each consisting of multiple transcription factor binding sites and sequences whose importance can be demonstrated via mutation even if a specific binding factor hasn't been identified).

      The SHH enhancer I'm most familiar with is ~ 800 -1000bp in length and is situated ~ 1Mb away from the transcriptional start site for the SHH gene. (It's ~ 1Mb in humans more like 800kb in mice. In both cases it is located within the intron for another gene.) It is strictly required for SHH expression in developing limb buds. In mice if you delete the region you end up with a mouse in which the gene is expressed normally everywhere but the limb in which expression is abolished. (Expression of that other gene in whose intron the sequence is located isn't affected.) Mice with the deletion are born with severely truncated limbs. Within the same region in humans (and mice, chickens, cats and likely other species as well) there are a number of much smaller mutations that cause polydactyly. The gene is normally expressed in a very small region of a developing limb bud. These small mutations (sometimes as small as a single base change) result in the gene being expressed in regions of the limb bud where it isn't normally expressed and that results in extra fingers and/or toes.

      The small mutations have been identified pretty much by chance because extra fingers and/or toes are pretty obvious and they don't affect viability.

      Delete
    4. An average of 40 per gene is just that, an average.

      Exactly. For every ten genes that have only 10 regulatory sites there have to be ten genes with 70 binding sites.

      An example of a gene with at least 40 regulatory sites is SHH (sonic hedgehog)

      Could you give me some references?

      Delete
  4. I should add that suggesting that house-keeping genes are likely to have fewer regulatory sites is really just my bias. You could argue that house keeping genes, because of their importance, might have more because they'll have more redundancy. But most of the studies that I'm aware of focus on genes with complex patterns of expression so those are the enhancers that have also been most studied.

    ReplyDelete
    Replies
    1. Can you point me to examples of those genes with complex patterns of expression where the biological function of 40 or more transcription factor binding sites have been demonstrated?

      Please be careful about terminology. An "enhancer" is a site where there's solid evidence of biological funtion. In the absence of evidence it is not an enhancer. It is a "putative enhancer" or just a binding site.

      The scientific literature is full of claims about enhancers where the only "evidence" is the presence of a binding site. That's called begging the question.

      Delete
  5. Larry writes:
    "I saw nothing to suggest the workshop participants cared one hoot about either of these debates.

    Agree, and I said as much in the thread that got all this started:

    "I think they don't really care how much of the genome is functional."

    http://sandwalk.blogspot.com/2017/02/what-did-encode-researchers-say-on.html?showComment=1486846378044#c1661586536983818740

    "The planning stage was all about collecting more and more data."

    Is there a problem with that? Even supposing these regions are not causally functional, they could be still diagnostic (symptomatic), and that is pretty important too. So the data collection will continue to be funded. The medical community wants the data, plain and simple.

    What drives ENCODE data collection isn't "the genome is 80% functional", it's data collection. Even supposing the genome is only 10% functional, unless we know exactly in advance which is and which is not functional, we have to keep doing data collection. Look at the few lncRNAs we found as functional (like XIST). Unless we looked and did data collect, we probably wouldn't know it had function. We still have to do this even if LOLAT lncRNAs aren't functional.

    "It's enough for 40 regulatory sites for every known gene in our genome."

    Because histones are potential regulatory targets and there is 1 histone set for about every 200 bp of DNA, that is about 825 regulatory regions of DNA per protein coding gene.

    X-inactivation is a good example of putting regulatory marks on a large fraction of the female inactive X-chromosome for dosage compensation. And we know it is highly targeted because X-inactivation leaves some genes alone (like those genes with Y chromosome homologs).

    70% of the genome histones have H3K27me3 PRC2 markings on it. Many on repetitive regions. That has regulatory significance. It looks more and more like repetitive regions have a lot of regulatory robustness.

    Why put multiple regulatory marks on every histone on each silenced gene on the X-chromosome, when in principle only one histone might be needed to be marked (like at the promoter region)?

    One reason is the intronic regions of one gene are used to regulate and transcribe other genes (possibly even on other chromsomes). How the heck can this be coordinated in cell-type specific manners unless some sort of chromatin modifications is taking place on these non-coding regions like histone modifications, or RNA binding to DNA to work with it (like FIRRE, XIST, HOTAIR).

    Ok, so assume the default is non-function. Even if ENCODE adopted that creed, it's not going to change the way they do business because unless we know in advance what is and is not junk, collecting this sort of data is basic research.

    ReplyDelete
    Replies
    1. Sal Cordova responds to a comment I made.

      "The planning stage was all about collecting more and more data."

      Is there a problem with that?

      The PI begins the meeting, "I'd like to call this group meeting to order."

      "As you know," she says, "we are interested in how much of the genome is functional. Most of you have collected a huge amount of data but we still don't know the answer to the question. Right now we can't tell whether the features you have identified have anything to do with biological function or whether they are just noise."

      Looking around the room, the PI asks, "What should we do?"

      "Let's just collect more data," says one of the post-docs.

      "Excellent idea!" exclaims the PI with a relieved look on her face. I'll write the grant.

      Delete
    2. Larry writes:
      "The overwhelming impression you get from looking at the presentation is that all the researchers believe all their data is real and reflects biological function in some way or another. "

      Right on! You called the attitude the way it really is. For once we agree.

      I asked an NIH research last year in passing about LINE-1s. He said he didn't know, but "it's there for a reason." That's just the natural sentiment you'll get from theses guys. Where it originates, we can only speculate, but it's not like that attitude is rare, it's pretty common place at the NIH and probably for most medical researchers.

      Delete
    3. It's there for a reason. Which is just another way of saying that something caused it to be there. Which would be true to say of the actual trash in a landfill.

      Delete
    4. Mikkel,

      “It's there for a reason. Which is just another way of saying that something caused it to be there. Which would be true to say of the actual trash in a landfill.”

      To square up your analogy, you’d need to notice a lot of amusement parks in the landfill interacting with the garbage. But how come the allergy to ‘something caused it”? Why is that offensive?

      I always enjoy reading what you have to say. The guard is always on duty, and I reckon that’s a good thing. Nobody wants to have anyone slip them a Mickey. But, how do you know that someone already didn’t?

      Delete
    5. Why do you think anyone is taking offense? That doesn't even make sense.

      I'm just here to help you not making leaps the data does not bear out. "It's there for a reason" doesn't get you to where you so desperately want to go.

      Delete
  6. Of the 400 members of the original ENCODE consortium, several were in my own department. One, John Stamatoyannopoulos, is convinced that there is little or no junk DNA. Several others were less convinced. One told me that from now on he was going to make sure to point out that there really was junk DNA when he presents his ENCODE work. Another, Max Librecht, was so clear and public on this that his dissent from the ENCODE announcement was featured in Ryan Gregory's Genomicron blog (here).

    But an alarming number of molecular biologists do think that almost all of the genome is not junk DNA. They have no answer for Graur's points -- they are not molecular evolutionists and do not understand the mutation load objection, the genome size variation objection, the lack-of-conservation onjection, or the transposable element objection. They just assume that the genome is a finely-tuned machine, all of whose features are "there for a reason". I am embarrassed for them (they lack the requisite embarrassment).

    One of the underlying motives for their view is probably this: think of all the grants we can apply for to work out what these parts of the genome do!

    But what do I know? I'm only a human, and therefore far inferior to our more highly-evolved relatives, the onion and the lungfish.

    ReplyDelete
    Replies
    1. Many of those molecular biologists graduated from university in the 1990s or even later. They did not receive a proper education in evolution or in biochemistry. A proper education would have taught them basic population genetics and basic properties of DNA binding proteins (among other things).

      Their teachers were scientists of our generation. Where did we go wrong?

      My colleagues who teach undergraduates are perpetuating some of the same mistakes my generation made. This is hardly surprising. We are graduating another generation of students who don't understand the fundamental concepts of our disciplines.

      How can we fix this?

      We been discussing genomes and junk DNA in my class on molecular evolution. The students are about to graduate in just a few months but it's the first time they've been told there's even a controversy. There's something seriously wrong here. Isn't critical thinking supposed to be our goal?

      Delete
    2. Many people in genomics, including a number of leading figures, aren't even molecular biologists -- they come from physics, statistics, engineering or computer science backgrounds, as the field needs serious computational expertise for method development, and method development is one thing it heavily revolves around in general.

      They mostly haven't even passed through those courses. I don't know how exactly they got their understanding of the subject, and it probably varies a lot from person to person, but I would venture a wild guess that given that they have mostly learned it on the fly while doing research, the kind of hype you get from the likes of Nature and Science has had a major influence on many

      Delete
    3. I think that somebody with a background from physics or statistics might be better prepared for this subject than biologists. When I got to college I enrolled for physics and switched to paleontology after a while. And when I got to the second half of my studies and had to pick a secondary subject I picked mathematics, with a focus on probability theory - mainly because what drove me to paleontology in the first place was reading David Raups "Nemesis affair" as a kid and figuring that my goal was to do research by performing statistical analyses on fossil invertebrate data I should try to take in as much of the maths as possible. The main hurdle for a lot of biology students is that population genetics requires maths. When I got around to actually looking at population genetics, I had enough of the mathematical background to actually read it. When I read Kimuras paper on the probability of fixation I already knew what the Kolmogorov equations were and what type of problems you could solve with them, I knew that you could approximate discrete stochastic processes with a Wiener process and how to get to the scaling needed. I could read the paper and it made sense. Without that prior knowledge it must seem like magic. Statisticians and physicists should have that background - The percentage of in text citations that reference Fisher, Wright and Haldane is higher in statistics textbooks than biology textbooks (the history of statistics basically has 3 parts: The early days where mathematicians would do a bit of statistics by the side and also developed relevant combinatorics, the modern synthesis and finally the bit that started with the Kolmogorov axioms where things were tidied up mathematically). And physicists can't really escape statistical mechanics either.
      I think we put too little mathematics into biology education. I had to seek it out (my choice of mathematics as a secondary was unprecedented - no one had ever chosen it in the 60 years or so in which you had to pick a secondary) and with recent changes in the programs it wouldn't even be possible to go a route similar to mine in an organized fashion - a current student would be able to enroll in the courses I did much less receive credit.

      Delete
    4. I fully agree with your points about math and biology.

      But the problem is that the people coming into biology from a math-heavy background are not reading Kimura. They might be much better positioned to understand his work, but they're just not reading it.

      They're coming into the field of genomics, not the field of evolutionary theory. So their job is to build computational tools for assembling genomes, processing various types of sequencing data, building statistical models for medically relevant genetics, etc. And they get influenced by whatever hype is being pushed at the moment. Also keep in mind that a lot of this work is being done in medical schools and other biomedically oriented institutions that often don't even have evolutionary biologists in the ranks of their faculty.

      Obviously this doesn't apply to every single such individual, very far from it, but on average, there probably is a such a trend.

      Delete
    5. Dr. Moran writes: ". A proper education would have taught them basic population genetics and basic properties of DNA binding proteins (among other things)."

      Regarding DNA binding proteins, does this mean DNA binding proteins have at least some random affinity for a random section of DNA. That is, when we throw in the formaldehyde or some other agent to "freeze" the state of what is bound to the DNA, we'll be getting a random binding that really has no utility by the cell?

      Thanks in advance.

      Btw, this is a pretty good discussion so far. Thanks for hosting it.

      I was part of a separate 3-day ENCODE meeting in 2015 open somewhat to the public. A molecular biologist there was lamenting ENCODE didn't provide more facility for tracking "function" of repetitive elements. Over lunch he told me he was studying a particular gene for over 20 years, and his lab work gave him good evidence the repetitive elements were of regulatory significance. He was in the middle of applying for a grant to investigate it. He struck me as sincere and seemed typical of a lot of the lab researchers there. I think the sentiment you lament about is going to be hard to undo, whatever its causes.

      Delete
    6. But the problem is that the people coming into biology from a math-heavy background are not reading Kimura. They might be much better positioned to understand his work, but they're just not reading it.

      They're coming into the field of genomics, not the field of evolutionary theory.


      Can anybody interpret Georgi's and obviously many others' anxiety? Because that's what ENCODE did to most if not all evolutionists.

      I'm really glad that Georgi told us "where the bodies are buried" because I suspected there were some...

      Here it is:

      "They're (ENCODE scientist who are not evolutionists) coming into the field of genomics, not the field of evolutionary theory.
      Interpretation: If scientists interpret data that is not aligned with current and accepted evolutionary theory (whichever that is now, who knows?) they should do what Georgi? Disregard it? For the sake of what? Your ambitions or the better good of Darwinian bullies who don't like what ENCODE has found and possibly will find in the future?


      Delete
    7. A good discussion. In computational molecular biology courses here, we try to educate students not only in algorithmics and computation, but also in statistics and evolutionary biology. But most genomicists training even in our department (Genome Sciences) don't go through those courses, only those interested in CMB or "bioinformatics". And elsewhere, even the "bioinformatics" courses concentrate on teaching Python and some algorithmics. Bioinformatics textbooks show the same biases, with much discussion of BLAST, sequence assembly, and alignment. But typically the phylogenies material is towards the back of the book and only 10-15 pages long. And they may have no population genetics material at all.

      It may take some sort of humiliation of molecular biologists and genomicists, such as people seeing them waste a lot of money and then fail to find any function in much of the genomes. Then they might start asking how they went wrong.

      Delete
    8. Joe,
      Blah, blah, blah...

      Let's face it; What are you going to do if most of the human genome is proven to be functional?

      I will answer that for you:

      You are not going to throw away your life's work that is the opposite to the findings for a couple of "morons" who have the evidence to prove you wrong...

      Delete